Академический Документы
Профессиональный Документы
Культура Документы
Eirinaki
NEW APPROACHES
TO WEB PERSONALIZATION
Ph.D. THESIS
May 2006
ii
ACKNOWLEDGEMENTS
...And if you find her poor, Ithaka won't have fooled you. Wise as you will have
become, so full of experience, you will have understood by then what these Ithakas
mean." Constantine P. Cavafis (1863-1933)
There are many people that I need to thank for making this long journey so
memorable. First of all, I thank my advisor, Michalis Vazirgiannis, for believing in me
and supporting me all these years, providing me with valuable advice, and giving me the
opportunity to travel to several places and meet very interesting people during project
meetings or conferences.
I would also like to thank the members of my PhD examination committee, namely,
professors Ioannis Milis, Gerhard Weikum, Emmanouil Yakoumakis, Emmanouil
Yannakoudakis, Martha Sideri and Vassilis Vassalos.
I would like to extend my sincerest thanks to my collaborators during my PhD. First
of all, Iraklis Varlamis, for our fruitful discussions that constituted my first steps in
research. Also, Giorgos Tsatsaronis, Dimitris Kapogiannis, and especially Charalampos
Lampos and Stratos Pavlakis, that worked really hard as undergraduate students offering
their excellent implementation skills, as well as valuable insights concerning our work. I
also thank Sarabjot S. Anand and Joannis Vlachakis, my collaborators during a European
project.
As a member of the DB-NET group, I had the chance to meet and befriend many
people. My thanks go to Maria Halkidi, Yannis Batistakis, Christos Pateritsas, Euripides
Vrachnos, Christoforos Ververidis, Christos Doulkeridis, Giorgos Tsatsaronis, Dimitris
Mavroeidis, our wonderful secretary Viky Sambani, and my good friends, Iraklis
Varlamis and Stratis Valavanis, for making these years fun and carefree, even during our
numerous moves, or our deadlines.
A friend is one who believes in you when you have ceased to believe in yourself.
There are many times during ones PhD when one wants to give up. My gratitude goes to
all my friends (thankfully too many to be mentioned individually), especially Elena
iii
Avatagelou, Nikos Karelos, Matoula Kalyveza, my brother Pavlos Eirinakis and my very
best friend, Foteini Glykou, for being there for me.
I should thank the person that motivated me to become a computer scientist, my
uncle, professor Panagiotis Varelas. Throughout the years, he was always challenging me
with brain-teasing mathematical problems, introducing me to the fascinating world of
logic, algorithms, and, eventually, Informatics.
Special thanks to Alkis Polyzotis. He is the one that inspired and motivated me to
start this journey. His insights and advices during all these years enabled me to set higher
standards for my research. He has been a true friend and mentor, and I am very happy
that we have started a new journey together.
Finally, I come to the ones I thank the most for their constant love, support, and
encouragement, my parents Kyriaki and Pantelis Eirinakis. They believe in me and
always do everything in their power to let me pursue my dreams. I owe to them
everything that I have accomplished to this day. This thesis is dedicated to them.
iv
TABLE OF CONTENTS
LIST OF FIGURES ........................................................................................................ vii
LIST OF TABLES ........................................................................................................... ix
ABSTRACT...................................................................................................................... xi
1 Introduction.................................................................................................................... 1
1.2 Contributions........................................................................................................ 6
1.3 Thesis Outline ...................................................................................................... 9
2 Preliminaries & Related Work ................................................................................... 11
2.1 Usage Data Pre-processing ................................................................................ 11
2.2 Web Usage Mining and Personalization............................................................ 14
2.3 Integrating Content Semantics in Web Personalization..................................... 15
2.4 Integrating Structure in Web Personalization .................................................... 16
3 Semantic Web Personalization ................................................................................... 19
3.1 Motivating Example........................................................................................... 20
3.2 SEWeP System Architecture ............................................................................. 23
3.3 Similarity of Ontology Terms............................................................................ 25
3.3.2 THESUS Similarity Measure.................................................................... 26
3.4 Content Characterization ................................................................................... 26
3.4.1 Keyword Extraction .................................................................................. 27
3.4.2 Keyword Translation ................................................................................ 28
3.4.3 Semantic Characterization ........................................................................ 30
3.5 C-Logs Creation & Mining................................................................................ 32
3.6 Document Clustering ......................................................................................... 32
3.7 Recommendation Engine ................................................................................... 33
3.7.1 Semantic Recommendations..................................................................... 34
vi
vii
LIST OF FIGURES
Figure 1. The web personalization process......................................................................... 3
Figure 2. SEWeP architecture........................................................................................... 24
Figure 3. The keyword translation procedure ................................................................... 30
Figure 4. The semantic characterization process .............................................................. 31
Figure 5. The semantic recommendation method............................................................. 35
Figure 6. The category-based recommendation method .................................................. 36
Figure 7. Experiment #1: Recommendation sets evaluation ........................................... 39
Figure 8. Experiment #2: Original vs. Hybrid Recommendations................................... 40
Figure 9. Experiment #3: Semantic vs. Hybrid Recommendations .................................. 41
Figure 10. Experiment #4: Category-based vs. Hybrid Recommendations ..................... 41
Figure 11. SEWeP screenshot: The Logs Preprocessing module ..................................... 44
Figure 12. SEWeP screenshot: the Session Management module.................................... 45
Figure 13. SEWeP screenshot: the Semantic Association Rules Mining module ............ 45
Figure 14. The IKUM system architecture ....................................................................... 46
Figure 15. The Greek Web Archiving system architecture .............................................. 48
Figure 16. PageRank-based example................................................................................ 53
Figure 17. Usage-based PageRank (UPR) example ......................................................... 53
Figure 18. NG Creation Algorithm ................................................................................... 56
Figure 19. Navigational Graph ......................................................................................... 57
Figure 20. NG synopsis (Markov Chain) .......................................................................... 59
Figure 21. prNG of Markov Chain NG synopsis .............................................................. 66
Figure 22. prNG of 2nd order Markov model NG synopsis ............................................. 66
Figure 23. Construction of prNG ...................................................................................... 67
Figure 24. Path expansion subroutine............................................................................... 67
Figure 25. Average OSim and KSim of top-n rankings for msnbc data set....................... 76
Figure 26. Average OSim and KSim of top-n rankings for cti data set............................. 76
Figure 27. OSim for msnbc data set, Markov Chain NG synopsis ................................... 78
Figure 28. KSim for msnbc data set, Markov Chain NG synopsis ................................... 78
Figure 29. OSim for cti data set, Markov Chain NG synopsis.......................................... 79
viii
Figure 30. KSim for cti data set, Markov Chain NG synopsis.......................................... 79
Figure 31. OSim for msnbc data set, 2nd-order Markov model NG Synopsis ................... 81
Figure 32. KSim for msnbc data set, 2nd-order Markov model NG Synopsis ................... 81
Figure 33. Comparison of l-UPR and h-PPM, Markov Chain NG synopsis .................... 83
Figure 34. The Prior Probabilities Computation module.................................................. 85
Figure 35. The Path Probabilities Computation module................................................... 86
Figure 36. The l-UPR Path Prediction module ................................................................. 86
ix
LIST OF TABLES
Table 1: Related Work ...................................................................................................... 18
Table 2. URIs and related concept hierarchy terms.......................................................... 22
Table 3. User Sessions ...................................................................................................... 56
Table 4. Path Frequencies ................................................................................................ 59
Table 5. Top-10 Frequent Paths...................................................................................... 105
Table 6. Top-10 ranking for Start setup.......................................................................... 106
Table 7. Top-10 ranking for Total setup......................................................................... 106
xi
ABSTRACT
The impact of the World Wide Web as a main source of information acquisition is
increasing dramatically. The existence of such abundance of information, in combination
with the dynamic and heterogeneous nature of the web, makes web site exploration a
difficult process for the average user. To address the requirement of effective web
navigation, web sites provide personalized recommendations to the end users. Most of the
research efforts in web personalization correspond to the evolution of extensive research
in web usage mining, i.e. the exploitation of the navigational patterns of the web sites
visitors. When a personalization system relies solely on usage-based results, however,
valuable information conceptually related to what is finally recommended may be
missed. Moreover, the structural properties of the web site are often disregarded.
In this thesis, we propose novel techniques that use the content semantics and the
structural properties of a web site in order to improve the effectiveness of web
personalization. In the first part of our work we present SEWeP (standing for SEmantic
Web Personalization), a personalization system that integrates usage data with content
semantics, expressed in ontology terms, in order to compute semantically enhanced
navigational patterns and effectively generate useful recommendations. To the best of our
knowledge, SEWeP is the only semantic web personalization system that may be used by
non-semantic web sites.
In the second part of our work, we present a novel approach for enhancing the quality
of recommendations based on the underlying structure of a web site. We introduce UPR
(Usage-based PageRank), a PageRank-style algorithm that relies on the recorded usage
data and link analysis techniques. UPR is applied on an abstraction of the user sessions
termed Navigational Graph in order to determine the importance of a web page. We
develop l-UPR, a recommendation algorithm based on a localized variant of UPR that is
applied to the personalized navigational sub-graph of each user. Moreover, we integrate
UPR and its variations in a hybrid probabilistic predictive model as a robust mechanism
for determining prior probabilities of page visits. Overall, we demonstrate that our
xii
CHAPTER 1
Introduction
During the past few years the World Wide Web has become the biggest and most
popular way of communication and information dissemination. It serves as a platform for
exchanging various kinds of information, ranging from research papers, and educational
content, to multimedia content, software and personal logs (blogs). Every day, the web
grows by roughly a million electronic pages, adding to the hundreds of millions pages
already on-line. Because of its rapid and chaotic growth, the resulting network of
information lacks of organization and structure. Users often feel disoriented and get lost
in that information overload that continues to expand. On the other hand, the e-business
sector is rapidly evolving and the need for web market places that anticipate the needs of
their customers is more than ever evident. Therefore, the ultimate need nowadays is that
of predicting the user needs in order to improve the usability and user retention of a web
site. This thesis presents novel methods and techniques that address this requirement. We
elaborate on the problems that motivated our work in Section 1.1, and we outline our
contribution in Section 1.2. A rough plan of this thesis is included in Section 1.3.
1.1 Motivation
Imagine a user that navigates through the pages of a web portal, specializing in sports.
We will refer to it (hypothetically) as the Sportal, residing on the imaginary site
www.theSportal.com. This user is a fan of winter ski and would like to visit a ski resort
for the holidays. He therefore searches to find any related information available, ranging
from winter resort hotels to weather reports and ski equipment. Since the amount of
information in the Sportal is very big, this information is not necessarily organized as a
single thematic module. Based on this users navigation, however, in combination with
previous users visits focusing on the same subject (winter ski vacation), the system
makes recommendations to the user.
Assume, for example, that many users in the past have seen the pages
www.theSportal.com/events/ski.html, www.theSportal.com/travel/ski_resorts.html, and
www.theSportal.com/equipment/ski_boots.html during the same visit. If the current user
visits the first two, the system can recommend the third one, based on the assumption that
people with similar interests present similar navigational behavior. Moreover, since the
current visitor seems to be interested in pages concerning the winter, ski, and resorts
thematic areas of the portal, the system may recommend other pages that are related with
these
categories,
such
as
page
about
ski
equipment
that
is
on
sale
available. The web personalization process is illustrated in Figure 1. Using the four
aforementioned sources of information as input to pattern discovery techniques, the
system tailors the provided content to the needs of each visitor of the web site. The
personalization process can result in the dynamic generation of recommendations, the
creation of index pages, the highlighting of existing hyperlinks, the publishing of targeted
advertisements or emails, etc. In this thesis we focus on personalization systems that aim
at providing personalized recommendations to the web sites visitors. Furthermore, since
the personalization algorithms we propose in this work are generic and applicable to any
web site, we assume that no explicit knowledge involving the users profiles, such as
ratings or demographic information, is available.
personalization, however, presents certain shortcomings. This may happen when, for
instance, there is not enough usage data available in order to extract patterns related to
certain navigational actions, or when the web sites content changes and new pages are
added but are not yet included in the web logs. Moreover, taking into consideration the
temporal characteristics of the web in terms of its usage, such systems are very
vulnerable to the training data used to construct the predictive model. As a result, a
number of research approaches integrate other sources of information, such as the web
content [AG03, DM02, EGP02, GKG05, JZM04b, JZM05, MD+00b, ML+04, MSR04,
OB+03, PE00] or the web structure [BL06, HLC05, NM03, ZHH02b] in order to enhance
the web personalization process.
As already implied, the users navigation is largely driven by semantics. In other
words, in each visit, the user usually aims at finding information concerning a particular
subject. Therefore, the underlying content semantics should be a dominant factor in the
process of web personalization. The web sites content characterization process involves
the feature extraction from the web pages. Usually these features are keywords
subsequently used to retrieve similarly characterized content. Several methods for
extracting keywords that characterize web content have been proposed [BP98, CD+99,
HG+02]. The similarity between documents is usually based on exact matching between
these terms. This way, however, only a binary matching between documents is achieved,
whereas no actual semantic similarity is taken into consideration. The need for a more
abstract representation that will enable a uniform and more flexible document matching
process, imposes the use of semantic web structures, such as ontologies1 [BHS02,
HN+03]. By mapping the keywords to the concepts of an ontology, or topic hierarchy,
the problem of binary matching can be surpassed through the use of the hierarchical
relationships and/or the semantic similarities among the ontology terms, and therefore,
the documents.
Finally, we should take into consideration that the web is not just a collection of
documents browsed by its users. The web is a directed labeled graph, including a plethora
In this work we focus on the hierarchical part of an ontology. Therefore, in the rest of this work we use
the terms concept hierarchy, taxonomy and ontology interchangeably.
of hyperlinks that interconnect its web pages. Both the structural characteristics of the
web graph, as well as the web pages and hyperlinks underlying semantics are important
and determinative factors in the users navigational process. We briefly discuss the most
important research studies2 based on the aforementioned intuitions below, while a more
detailed overview of related work is given in Chapter 2.
Several research studies proposed frameworks that express the users navigational
behavior in terms of an ontology and integrate this knowledge in semantic web sites
[OB+03], Markov model-based recommendation systems [AG03], or collaborative
filtering systems [DM02]. Overall, all the aforementioned approaches are based on the
same intuition: enhance the web personalization process with content semantics,
expressed using the terms of a domain-ontology. The extracted web content features are
mapped to ontology terms and this abstraction enables the generalizations/specializations
of the derived patterns and/or user profiles. In all proposed models, however, the
ontology-term mapping process is performed manually or semi-automatically (needing
the manual labeling of the training data set). As far as the content characterization
process is concerned, the features characterizing the web content are extracted from the
web page itself, ignoring semantics arising from the connectivity features of the web
[BP98, CD+98]. Some approaches are based on collaborative filtering systems, which
assume that some kind of user ratings are available, or on semantic web sites, which
assume that an existing underlying semantic annotation of the web content is available a
priori. Finally, none of the aforementioned approaches fully exploits the underlying
semantic similarities of terms belonging to an ontology, apart from the straightforward
is-a or parent-child hierarchical relationships.
As far as the exploitation of the connectivity features of the web graph is concerned,
even though they have been extensively used for personalizing web search results
[ANM04, H02, RD02, WC+02], only a few approaches exist for enhancing the web
recommendation process, either using the degree of link connectivity for switching
among different recommendation models [NM03] or using citation network analysis for
clustering related pages in a recommendation system based on Markov models
2
At this point, we focus on research studies that appeared prior, or in parallel to our work.
[ZHH02b]. None of the aforementioned systems, however, exploits the notion of a web
pages importance in the web graph and fully integrates link analysis techniques in the
web personalization process.
1.2 Contributions
The main contribution of this thesis is a set of novel techniques and algorithms aimed
at improving the overall effectiveness of the web personalization process through the
integration of the content and the structure of the web site with the users navigational
patterns.
In the first part of our work we present the semantic web personalization system
SEWeP that integrates usage data with content semantics in order to compute
semantically
enhanced
recommendations.
navigational
Similar
to
patterns
previously
and
proposed
effectively
approaches,
generate
the
useful
proposed
personalization framework uses ontology terms to annotate the web content and the
users navigational patterns. The key departure from earlier approaches, however, is that
SEWeP is the only web personalization framework that employs automated keyword-toontology mapping techniques, while exploiting the underlying semantic similarities
between ontology terms. Apart from the novel recommendation algorithms we propose,
we also emphasize on a hybrid structure-enhanced method for annotating web content.
To the best of our knowledge, SEWeP is the only semantic web personalization system
that can be used by any web site, given only its web usage logs and a domain-specific
ontology.
Our key contributions regarding this framework are:
association rules mining etc.) relying on the semantic similarity between web
documents.
Two recommendation algorithms which integrate web content semantics with the
users navigational behavior. The web pages are characterized by a set of domainontology terms. This uniform characterization enables the categorization of the
web pages into semantically coherent clusters, as well as the semantic
enhancement of the web logs. These two enhanced sources of knowledge are then
used by the proposed methods to generate recommendations that are semantically
relevant to the current navigational behavior of each user. The first method
generates recommendations by expanding the association rules derived by mining
the web logs, using the most similar document cluster. The second method
generates a new type of association rules, named category-based association rules,
which are computed by mining the semantically enhanced logs (called C-logs)
and expanding the recommendation set based on the most similar document
cluster.
In the second part of our work, we encompass the notion of authority transfer, as
defined in the most popular link analysis algorithm, PageRank [BP98]. The underlying
assumption is that a web page is considered to be important (in other words is an
authority) if other important pages have a link pointing it. In other words, authority pages
transfer some of their importance to the pages they link to, and so on. Motivated by the
fact that in the context of navigating a web site, a page/path is important if many users
have visited/followed it before, we propose a novel algorithm, named UPR, that assigns
importance rankings (and therefore visit probabilities) to the web sites pages. UPR
(Usage-based PageRank) is a PageRank-style algorithm that is applied on an abstraction
of the user sessions termed the Navigational Graph (NG). Based on this generalized
personalization framework, we specialize it in two different contexts. We develop l-UPR,
a recommendation algorithm based on a localized variant of UPR that is applied to the
personalized navigational sub-graph of each user for providing fast, online
recommendations. Moreover, we integrate UPR and its variations in a hybrid
probabilistic predictive model (h-PPM) as a robust mechanism for determining prior
probabilities of page visits. To the best of our knowledge, this is the first integrated
solution addressing the problem of web personalization using a page ranking approach.
More specifically, our key contributions are:
The application of UPR for extending and enhancing standard web usage mining
and personalization probabilistic models such as Markov models. We present a
hybrid probabilistic prediction framework (h-PPM) where UPR, as well as its
variations, are used for assigning prior probabilities to the nodes (pages) of any
Markov model based on the topology (structure) and the navigational patterns
(usage) of the web site.
10
CHAPTER 2
Preliminaries & Related Work
In this Chapter we start by briefly presenting the data preprocessing issues that should
be taken into consideration prior to applying any web mining and personalization
techniques to the usage data. We then provide a review of related research efforts,
ranging from the earlier approaches that focus on web usage mining, to the ones focusing
on web personalization. We then present those that integrate content and/or structure data
in the web personalization process, emphasizing on the research efforts (previous and
subsequent) that are more similar to our work3. We provide a summarized overview of all
related research efforts categorized by the web mining method employed and their
application area in Table 1. The areas covered by our work are depicted by highlighted
cells.
A more detailed overview of the related work, as well as references to related commercial products can be
found in [EV03, E04].
12
[W3Clog]. In general, the extended log format consists of a list of prefixes and
identifiers, some of which can be found in Table such as c (client), s (server), r (remote),
cs (client to server), sc (server to client), sr (server to remote server, used by proxies), rs
(remote server to server, used by proxies), x (application-specific identifier), and a list of
identifiers such as date, time, ip (records the IP of the client generating the page hit),
bytes (records the number of bytes transferred), cached (records whether a cache hit
occurred), status (records the status code returned by the web server), comment (comment
returned with status code), method (method used to retrieve data), uri (the URI
requested), uri-stem and uri-query. Using a combination of some of the aforementioned
prefixes and identifiers, additional information such as referrer, that is the web page the
client was visiting before requesting that page, user_agent, that is the software the client
is using, or keyword, that is the keywords used when visiting that page after a search
engine query, can be recorded. Except for the web server logs, which are the main source
of information in the web usage mining and personalization processes, useful information
can be acquired from proxy server logs, browser logs, registration data, cookies, user
ratings etc. Since in this thesis we present a generic personalization framework which can
be applied on any web site, requiring only the anonymous usage data recorded in its web
usage logs, we do not elaborate on such data sources.
Prior to processing the usage data using web mining or personalization algorithms,
the information residing in the web logs should be preprocessed. The web log data preprocessing is an essential phase in the web usage mining and personalization process. An
extensive description of this process can be found in [CMS99]. In the sequel, we provide
a brief overview of the most important pre-processing techniques, providing in parallel
the related terminology.
The first issue in the pre-processing phase is data preparation. Depending on the
application, the web log data may need to be cleaned from entries involving page
accesses that returned, for example, an error or graphics file accesses. Furthermore,
crawler activity usually should be filtered out, because such entries do not provide useful
information about the sites usability. A very common problem to be dealt with has to do
with web pages caching. When a web client accesses an already cached page, this access
13
is not recorded in the web sites log. Therefore, important information concerning web
path visits is missed. Caching is heavily dependent on the client-side technologies used
and therefore cannot be dealt with easily. In such cases, cached pages can usually be
inferred using the referring information from the logs and certain heuristics, in order to
re-construct the user paths, filling out the missing pages.
After all page accesses are identified, the pageview identification should be
performed. According to [WCA] a pageview is defined as the visual rendering of a web
page in a specific environment at a specific point in time. In other words, a pageview
consists of several items, such as frames, text, graphics and scripts that construct a single
web page. Therefore, the pageview identification process involves the determination of
the distinct log file accesses that contribute to a single pageview. Again such a decision is
application-oriented.
In order to personalize a web site, the system should be able to distinguish between
different users or groups of users. This process is called user profiling. In case no other
information than what is recorded in the web logs is available, this process results in the
creation of aggregate, anonymous user profiles since it is not feasible to distinguish
among individual visitors. However, if the users registration is required by the web site,
the information residing on the web log data can be combined with the users
demographic data, as well as with their individual ratings or purchases. The final stage of
log data pre-processing is the partition of the web log into distinct user and server
sessions. A user session is defined as a delimited set of user clicks across one or more
web servers, whereas a server session, also called a visit, is defined as a collection of
user clicks to a single web server during a user session [WCA]. If no other means of
session identification, such as cookies or session ids is used, session identification is
performed using time heuristics, such as setting a minimum timeout and assume that
consecutive accesses within it belong to the same session, or a maximum timeout,
assuming that two consecutive accesses that exceed it belong to different sessions. More
details on the user and session identification process can be found in [EV03].
14
15
16
semi-Markov process defined on this tree based on the observed user paths. In this
Markov models-based work, the semantic characterization of the content is performed
manually. Moreover, no semantic similarity measure is exploited for enhancing the
prediction process, except for generalizations/specializations of the ontology terms.
Finally, in a subsequent work, Middleton et. al [MSR04] explore the use of ontologies
in the user profiling process within collaborative filtering systems. This work focuses on
recommending academic research papers to academic staff of a University. The authors
represent the acquired user profiles using terms of a research paper ontology (is-a
hierarchy). Research papers are also classified using ontological classes. In this hybrid
recommender system which is based on collaborative and content-based recommendation
techniques, the content is characterized with ontology terms, using document classifiers
(therefore a manual labeling of the training set is needed) and the ontology is again used
for making generalizations/specializations of the user profiles.
between
unconnected
pairs
for
selecting
candidates
and
make
17
recommendations. In this study the graph nodes represent both users and rated/purchased
items.
Finally, subsequent to our work, Borges and Levene [BL06] proposed independently
two link analysis ranking methods, SiteRank and PopularityRank which are in essence
very much like the proposed variations of our UPR algorithm (PR and SUPR
respectively). This work focuses on the comparison of the distributions and the rankings
of the two methods rather than proposing a web personalization algorithm. The authors
concluding remarks, that the topology of the web site is very important and should be
taken into consideration in the web personalization process, further support our claim.
18
MM & Cl
WUM
WUM
&
Profile
[CMS97,
SC+00]
WUM
&
Content
WP
WP
&
Profile
WP
&
Content
WP
&
Structure
[PE00]
[ZHH02b]
[MD+00b,
MSR04*,
DM02*]
[HLC05]
[ML+04,
OB+03*]
[CPY96]
[BB+99,
BS00,
SFW99]
[KJ+01,
NC+03,
YZ+96]
[HN+01,
ZXH98]
N/A
[HF04]
[AP+04,
BS04]
[SZ+97]
N/A
N/A
[BL99,
DK04,
LL03,S00,
ZHH02a,]
[ZB04,
JZM04a]
N/A
[MD+00a,
NM02,
NP03,SK+00,
SK+01]
[AG03*]
[ADW02]
[JZM04b]
[HEK03,
NP04]
[EGP02]
[BL06]
[CH+00,
YH03]
[MPG03]
AR & SP
[MPT99]
AR & Cl
N/A
N/A
N/A
N/A
ML & CF
MM & CF
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
PM & CF
N/A
N/A
N/A
N/A
LA&MM
[NM03]
[GKG05]
[JF+97]
[KS04]
[JZM05]
19
CHAPTER 3
Semantic Web Personalization
The users navigation in a web site is typically content-driven. The users usually
search for information or services concerning a particular topic. Therefore, the underlying
content semantics should be a dominant factor in the process of web personalization. In
this thesis we present SEWeP (standing for Semantic Enhancement for Web
Personalization), a web personalization framework that integrates content semantics with
the users navigational patterns, using ontologies to represent both the content and the
usage of the web site.
In our proposed framework we employ web content mining techniques to derive
semantics from the web sites pages. These semantics, expressed in ontology terms, are
used to create semantically enhanced web logs, called C-logs (concept logs).
Additionally, the site is organized into thematic document clusters. The C-logs and the
document clusters are in turn used as input to the web mining process, resulting in the
creation of a broader, semantically enhanced set of recommendations. The whole process
bridges the gap between Semantic Web and Web Personalization areas, to create a
Semantic Web Personalization system. To the best of our knowledge, SEWeP is the only
system that provides an integrated solution for semantic web personalization and can be
used by any (semantic or not) web site, fully exploiting the underlying semantic
similarities of ontology terms. Parts of this chapter have appeared at [EVV03, EL+04].
20
In the Sections that follow we motivate the integration of content semantics in the
web personalization process using an illustrative example, and then present in more detail
the components of the SEWeP system. We conclude with an extensive experimental
evaluation of the system, as well as a brief description of system prototypes based (or
partly based) on the SEWeP framework.
www.theSportal.com/equipment/ski_boots.html.
One may easily interpret this pattern as: people that are interested in ski events and
search for winter vacations will probably be interested in purchasing ski boots. Based on
the assumption that this user is interested in finding a ski resort to spend her holidays and
using pure usage-based personalization, the next time a user U navigates through
Sportal and visits the first two web pages, the personalized site will dynamically
recommend to U the page included in the right hand side (RHS) of the rule.
21
The Sportals content, however, is continuously updated. Suppose that the ski
equipment department has just announced a sale on all ski boots:
www.theSportal.com/equipment/ski_boot_sale.html.
Since this is a new web page, it isnt included in the web logs, or is included in very low
ratio (no one or only a few users have visited this page), therefore is definitely not
included in the derived association rules comprising our navigational model. As a
consequence, if we follow the traditional usage-based personalization process, it will
never be recommended to U, even though it is apparent that it is very similar to their
search intentions.
Moreover, assume that Sportal also hosts another service, about the snow
conditions in several ski resorts, in the web page:
www.theSportal.com/weather/snowreport.html.
Again, the information residing in this page is very relevant with Us interests, but it is
not included in the association rules derived. This may occur, for example, if the web
administrator hasnt add a link from the ski-related pages to the weather page, therefore
not many users have followed this path before.
As a third scenario, consider the case when U, instead of following the previous path,
visits the web pages
www.theSportal.com/sports/winter_sports/ski.html,
www.theSportal.com/travel/winter/hotels.html.
It is obvious that this visit is semantically similar to the previous one and the objective of
the user the same. The system however, will not provide the same recommendations to U,
since it wont recognize this similarity. Moreover, in case these two web pages are not
included in an association rule in the knowledge base, for any of the aforementioned
reasons, the system will recommend nothing to him!
Based on the aforementioned example, it is evident that pure usage-based
personalization is problematic in several cases. We claim that information conceptually
related to the users visit should not be missed, and introduce the SEWeP
personalization system that addresses the aforementioned shortcomings by generating
semantically enhanced recommendations.
22
www.theSportal.com/sports/winter_sports/ski.html
www.theSportal.com/travel/ski_resorts.html
www.theSportal.com/travel/winter/hotels.html
www.theSportal.com/equipment/ski_boots.html
www.theSportal.com/equipment/ski_boot_sale.html
www.theSportal.com/weather/snowreport.html
Based on the semantic similarity between these terms, the respective web pages are
categorized in semantic clusters (since the terms are hierarchically correlated). SEWeP
recommendation engine generates both URI-based (as any usage-based personalization
system does) and category-based association rules (e.g. snow, winter, hotel travel,
equipment). These rules are then expanded to include documents that fall under the most
similar semantic cluster.
Returning to our scenario, assume that the user visits the web pages:
www.theSportal.com/events/ski.html, and
www.theSportal.com/travel/ski_resorts.html.
23
The system, based on the URI-based association rules derived from web log mining,
finds the most relevant rule and recommends its RHS to him. This recommendation set
will be referred to as original recommendations:
www.theSportal.com/events/ski.html,
www.theSportal.com/travel/ski_resorts.html
www.theSportal.com/equipment/ski_boots.html.
Moreover, it expands the recommendation set by including documents that belong to the
same thematic cluster as the URI proposed, generating semantic recommendations:
www.theSportal.com/equipment/ski_boot_sale.html
www.theSportal.com/weather/snowreport.html.
Assume now that another user navigates through the web site, visiting the web pages
www.theSportal.com/sports/winter_sports/ski.html,
www.theSportal.com/travel/winter/hotels.html.
Based on the derived URI-based association rules, a usage-based personalization system
would not find a matching association rule and wouldnt recommend anything. SEWeP,
however, based on the category-based association rules it generates, abstracts the users
visit and matches it with the category-based rule:
ski, winter, travel
snow, equipment
It then recommends documents that belong to the cluster which is characterized by the
RHS terms. This recommendation set will be referred to as category-based
recommendations. In what follows, we describe in detail how SEWeP implements the
aforementioned process.
24
Content Characterization. This module takes as input the content of the web site
as well as a domain-specific ontology and outputs the semantically annotated
content to the modules that are responsible for creating the C-Logs and the
semantic document clusters. The content characterization process consists of the
keyword extraction, keyword translation and semantic characterization subprocesses which are described in more detail in Section 3.4.
C-Logs Creation & Mining. This module takes as input the web sites logs as well
as the semantically annotated web site content. It outputs the semantically
25
enhanced C-logs (concept logs) which are in turn used to generate both URI and
category-based frequent itemsets and association rules. These rules are
subsequently matched to the current users visit by the recommendation engine.
We overview this process in Section 3.6.
Recommendation Engine. This module takes as input the current users path and
matches it with the semantically annotated navigational patterns generated in the
previous phases. The recommendation engine generates three different
recommendation sets, namely, original, semantic and category-based ones,
depending on the input patterns used. In Section 3.7 we overview the two novel
recommendation algorithms that are employed by SEWeP.
The creation of the ontology as well as the semantic similarity measures used as input
in the aforementioned web personalization process are orthogonal to the proposed
framework. We assume that the ontology is descriptive of the web sites domain and is
provided/created by a domain expert. In what follows we describe the key components of
our architecture, starting by introducing the similarity measures we used in our work.
26
organized terms [EM+06, MT+05]. The definitions of the two similarity measures are
given in what follows.
3.3.1 Wu&Palmer Similarity Measure
Given a tree, and two nodes a, b of this tree, their similarity is computed as follows:
WPsim ( a, b) =
(1)
where the node c is their deepest (in terms of tree depth) common ancestor.
3.3.2 THESUS Similarity Measure
Given a concept hierarchy O and two sets of weighted terms A={(wi, ki)} and B={(vi,
hi)}, with wi, vi O, their similarity is defined as:
1 |B|
1 1 | A|
THEsim( A, B ) = max (i , j WPsim(ki , h j ))+ max ( i , j WPsim(hi , k j ))
2 K i =1 j[1,|B|]
H i =1 j[1,| A|]
where i , j =
wi + v j
2 max wi , v j
(2)
|A|
and K =
i , x( i )
,with
i =1
27
28
Tf*Idf family [SB98]. Raw term frequency is based on the term statistics within a
document and is the simpler way of assigning weights to terms. Tf*Idf is a method used
for collections of documents, i.e. documents that have similar content. In the case of a
Web site however, this assumption is not always true since a Web site may contain
documents that refer to different thematic categories (especially in the case of Web
portals) and this was the reason for choosing raw term frequency as the term weighting
method of our approach.
At the end of this phase, each document d is characterized by a weighted set of
keywords d = {(ki,wi)}, where wi is the weight representing the summed (over the
combination of methods) word frequency of keyword ki. Before proceeding with
mapping the extracted keywords to related ontology terms, all non-English keywords
should be translated. In our approach, we determine the most suitable synonym using a
context-sensitive automated translation method, which is described in detail in the
Section that follows.
3.4.2 Keyword Translation
As already mentioned, the recommendation process is based on the characterization
of all web documents using a common representation. Since many web sites contain
content written in more than one language, this raises the issue of mapping keywords
from different languages to the terms of a common domain-ontology.
Consider, for example, the web site of a Computer Science department, or of a
research group in Greece. This site will contain information addressed to the students,
which will be written in Greek, research papers, which will be written in English, and
course material, which will be written in both languages. Since the outcome of the
keyword extraction process is a mixed set of English and Greek words, the translation of
all Greek keywords to English should be performed, prior to selecting the most frequent
ones. By using any dictionary, each Greek word (after stemming and transforming to the
nominative) will be mapped to a set of English synonyms; the most appropriate synonym,
however, depends on the context of the web pages content. A naive approach would be
to keep all possible translations, or a subset of them, but this would result in a high
number of keywords and would lead to inaccurate results. Another less computationally
29
intensive approach would be to keep the first translation returned by the dictionary,
which is the most common one. The first translation, however, is not always the best.
For example the words plan, schedule and program are some of the translations of
the same Greek word (), however in the Informatics context the word
program is the one that should be selected.
To address this important issue, we propose to determine the most precise synonym
based on the content of the web page it was extracted from. Assuming that the set of
keywords will be descriptive of the web pages content, we derive the best synonym set
by comparing their semantics. This context-sensitive automated translation method is
applicable for any language, provided that a dictionary and its inflection rules are
available. In our system implementation we applied it for the Greek language.
Since all words in the Greek language (nouns, verbs, adverbs) can be inflected, we
perform stemming and transformation to the nominative of each Greek word prior to
applying the actual translation method. For this purpose, we used the inflection rules of
Triantafillidis Grammar [Tria]. The translation algorithm is depicted in Figure 3. The
input is the set of English and Greek keywords (En(D) and Gr(D) respectively) of
document D. The output is a set of English keywords K that best characterize the web
page. Let En(g) = {english translations of g, g Gr(D)} and Sn(g) = {Wordnet senses of
keywords in En(g)}. For every translated words sense (as defined by Wordnet), the
algorithm computes the sum of the maximum similarity between this sense and the senses
of the remaining keywords (let WPsim denote the Wu&Palmer distance between two
senses). Finally, it selects the English translation with the maximum scored sense. The
algorithm has complexity O(kn2) for every Greek keyword, where n is the number of
senses for every keyword and k is the number of remaining words w. Since this algorithm
is applied off-line once for every document D, it does not constitute a bottleneck in the
systems online performance.
30
An initial experimental evaluation [LE+04] has shown promising results for the
proposed approach, but several issues remain open. For instance, our technique makes an
implicit assumption of one sense per discourse, i.e., that multiple appearances of the
same word will have the same meaning within a document. This assumption might not
hold in several cases, thus leading to erroneous translations. Our technique constitutes a
first step toward the automated mapping of keywords to the terms of a common concept
hierarchy; clearly, a more extensive study is required in order to provide a complete and
more precise solution.
Procedure translateW(Gr,En)
1. K ;
2. for all g Gr(D) do
3.
4.
score[s] = 0;
5.
6.
7.
score[s] += sim;
8.
done
9.
done
10.
smax = s;
31
provided by the thesaurus. Since the keywords carry weights according to their
frequency, the categories weights are also updated.
We should stress here that the selection of the ontology influences the outcome of the
mapping process. For this purpose, it should be semantically relevant to the content to be
processed. In order to find the closest term in the ontology O for a keyword k that
describes a document, we compute the Wu & Palmer similarity [WP94] between all
senses of k, Sn(k) and all senses of all the categories c in O, Sn(ci). At the end of this
process, each keyword is mapped to every category with a similarity s respectively. We
select the (k,c) pair that gives the maximum similarity s. This process is shown in Figure
4.
Procedure CategoryMapping(k, O)
1. for all sns Sn(k) do
2.
for all ci O do
3.
scsimmax maxargsc Sn(ci)(WPsim(sns, sc));
4.
done
5.
ssimmax = max({scsimmax});
6.
cmax = c O, for which (scsimmax == ssimmax);
7. done
8. sim = max({ssimmax});
9. cat = c {cmax}, for which (ssimmax == sim);
10. return(cat, sim);
11. done
(w
ri =
sj)
k j ci
wj
(3)
k j ci
where wj is the weight assigned to keyword kj for document d and sj the similarity with
which kj is mapped to ci. At the end of this process, each document d is represented as a
set d = {(ci, ri)}, where ri [0,1] since sj [0,1]. Even though the aforementioned process
has complexity O(|c|n2), where n is the number of senses of a word, it doesnt aggravate
the systems performance, since it is performed offline once, and is repeated only when
the content of a web page changes or when new web pages are added in the web site.
32
33
recommendations
are
the
straightforward
way
of
generating
recommendations, simply relying in the usage data of a web site. They are generated
when, for each incoming user, a sliding window of her past n visits is matched to the
URI-based association rules in the database, and the m most similar ones are selected.
The system recommends the URIs included in the rules, but not visited by the user so far.
The intuition behind semantic recommendations is that, useful knowledge
semantically similar to the one originally proposed to the users, is omitted for several
reasons (updated content, not enough usage data etc.) Those recommendations are in the
same format as the original ones but the web personalization process is enhanced by
taking into account the semantic proximity of the content. In this way, the system's
suggestions are enriched with content bearing similar semantics. In short, they are
generated when, for each incoming user, a sliding window of their past n visits is
34
matched to the URI-based association rules in the database, and the single most similar
one is selected. The system finds the URIs included in the rule but not yet visited by the
user (let A) and recommends the m most similar documents that are in the same semantic
cluster as A.
Finally, the intuition behind category-based recommendations is the same as the one
of semantic recommendations: incorporate content and usage data in the recommendation
process. This notion, however, is further expanded by expressing the users navigational
behavior in a more abstract, yet semantically meaningful way. Both the navigational
patterns knowledge database and the current users profile are expressed by categories.
Therefore, pattern matching to the current users navigational behavior is no longer exact
since it utilizes the semantic relationships between the categories, as expressed by their
topology in the domain-specific concept hierarchy. The final set of recommendations is
generated when, for each incoming user, a sliding window of the users past n visits is
matched to the category-based association rules in the database, and the most similar is
selected. The system finds the most relevant document cluster (using similarity between
category terms) and recommends the documents that are not yet visited by the user.
In what follows, we describe in detail the semantic and category-based
recommendations algorithms. The description of the generation of original
recommendations is omitted, since it is a straightforward application of the Apriori
algorithm to the sessionized web logs.
3.7.1 Semantic Recommendations
Navigational patterns. We use the Apriori algorithm [AS94] to discover frequent
itemsets and/or association rules from the C-Logs, CLg. We consider that each distinct
user session represents a different transaction. We will use S = { Im }, to denote the final
set of frequent itemsets/association rules, where Im = {(urii)}, urii CLg.
Recommendations. In brief, the recommendation method takes as input the users
current visit, expressed a set of URIs: CV = {(urij)}, urij WS, (WS is the set of the web
sites URIs. Note that some of these may not be included in CLg). The method finds the
itemset in S that is most similar to CV, and recommends the documents (labeled by
related categories) belonging to the most similar document cluster Clm Cl (Cl is the set
35
of document clusters). In order to find the similarity between URIs, we perform binary
matching (denoted as SIM). This procedure is shown in Figure 5.
Procedure SemanticRec(CV)
1. CM ;
2. Im = maxargIS SIM(I,CV);
3. for all d Im do
4.
for all cj d do
5.
if cj CM then
6.
rj += rj;
CM (cj,rj);
7.
8.
else
9.
CM (cj,rj);
10. done
11. done
12. return D = {d}, {d} Clm, maxargClnCl WPsim(CLm,CM);
36
use the THESUS metric (denoted as WPsim and THEsim respectively), defined in Section
3. This procedure is shown in Figure 6.
Procedure CategoryRec(CV)
1. Ik = maxargIS THEsim(I,CV);
2. for all cj CV do
3.
ci = maxargcIk WPsim(c,cj);
4.
cn = least_common_ancestor(ci,cj), rn = max(ri,rj);
5.
CI (cn, rn);
6. done
7. return D = {d}, {d} Cln, maxargClnCl WPsim(CLn,CI);
37
model, incorporating all three types of recommendations, generates the most effective
results.
3.8.1 Methodology
Data Set. We used the web logs of the DB-NET web site [DBN]. This is the site of a
research team, which hosts various academic pages, such as course information, research
publications, as well as members home pages. The two key advantages of using this data
set are that the web site contains web pages in several formats (such as pdf, html, ppt,
doc, etc.), written both in Greek and English and a domain-specific concept hierarchy is
available (the web administrator created a concept-hierarchy of 150 categories that
describe the sites content). On the other hand, its context is rather narrow, as opposed to
web portals, and its visitors are divided into two main groups: students and researchers.
Therefore, the subsequent analysis (e.g. association rules) uncovers these trends: visits to
course material, or visits to publications and researcher details. It is essential to point out
that the need for processing online (up-to-date) content, made it impossible for us to use
other publicly available web log sets, since all of them were collected many years ago
and the relevant sites content is no longer available. Moreover, the web logs of popular
web sites or portals, which would be ideal for our experiments, are considered to be
personal data and are not disclosed by their owners. To overcome these problems, we
collected web logs over a 1-year period (1/11/02 31/10/03). After preprocessing, the
total web logs size was approximately 105 hits including a set of over 67.500 distinct
anonymous user sessions on a total of 357 web pages. The sessionizing was performed
using distinct IP & time limit considerations (setting 20 minutes as the maximum time
between consecutive hits from the same user).
Keyword Extraction Category Mapping. We extracted up to 7 keywords from each
web page using a combination of all three methods (raw term frequency, inlinks,
outlinks). We then mapped these keywords to ontology categories and kept at most 5 for
each page.
Document Clustering. We used the clustering scheme described in [HN+03], i.e. the
DBSCAN clustering algorithm and the THESim similarity measure for sets of keywords.
38
However, other web document clustering schemes (algorithm & similarity measure) may
be employed as well.
Association Rules Mining. We created both URI-based and category-based frequent
itemsets and association rules. We subsequently used the ones over a 40% confidence
threshold.
3.8.2 Experimental Results
In our experiments, we chose three popular paths followed by users in the past, each
one having a different objective; one (A) containing visits to contextually irrelevant
pages (random surfer), a second (B) including a small path to very specialized pages
(information seeking visitor), and a third one (C) including visits to top-level, yet
research-oriented pages (topic-oriented visitor). We then conducted a series of 4
experiments. These paths, along with the recommendations generated for Experiment #2
are included in Appendix A.
Experiment #1. For the first experiment, we created three different sets of
recommendations named Original, Semantic, and Category (the sets are named after the
respective recommendation methods). We presented the users with the paths and the
three sets (unlabeled) in random order and asked them to rate them as indifferent,
useful or very useful. The outcome is shown in Figure 7.
The results of the first experiment revealed the fact that depending on the context and
purpose of the visit the users profit from different source of recommendations. More
specifically, in visit A, both Semantic and Category sets are mostly evaluated as
useful/very useful. The Category recommendation set performs better, and this can be
explained by the fact that its the one that recommends 3 hub pages, which seems to be
the best after a random walk on the site. On the other hand, in visits B and C, Semantic
performs better. In visit B, the path was focused to specific pages and the same held for
the recommendations preferences. In visit C the recommendations that were more
relevant to the topics previously visited were preferred.
39
Path A
70
60
50
40
30
20
10
0
l
c
na
nti
igi
ory
Or
ma
e
teg
S
a
C
Very Useful
Useful
Indifferent
Path B
70
60
50
40
30
20
10
0
l
c
na
nti
igi
ory
Or
ma
e
teg
S
a
C
Very Useful
Useful
Indifferent
Path C
70
60
50
40
30
20
10
0
l
na
tic
igi
r
an
ory
O
m
teg
Se
a
C
Very Useful
Useful
Indifferent
40
For that reason, we decided to evaluate the performance of a hybrid method that
incorporates all three types of recommendations. We run a set of experiments comparing,
for each path, each one of the proposed recommendation sets with a Hybrid
recommendation set, containing the top recommended URIs from each of the three
methods (Original, Semantic, Category). We then asked the users to choose the
recommendation set they preferred more.
Experiment #2: In the second experiment, we asked the users to choose between the
Hybrid and the Original recommendations set. The outcome is shown in Figure 8.
Original vs. Hybrid Recommendations
Original
Hybrid
80
70
60
50
40
30
20
10
0
Path A
Path B
Path C
41
specific interests, even though the Semantic recommendation set seems to prevail in the
case of specialized information seeking visits.
Semantic vs. Hybrid Recommendations
Semantic
Hybrid
90
80
70
60
50
40
30
20
10
0
Path A
Path B
Path C
Category-based
Hybrid
100
80
60
40
20
0
Path A
Path B
Path C
42
43
the recommendation engine; JDBC Library for MS SQL Server. The main functionalities
of the prototype are described below:
Logs Preprocessing: The system provides full functionality for preprocessing any
kind of web logs, by enabling the definition of new log file templates, filters
(including/excluding records based on field characteristics), etc. The clean logs are
stored in new files. A screenshot of the log preprocessing module is shown in Figure 11.
Content Retrieval: The system crawls the web and downloads the web sites pages,
extracting the plain text from a variety of crawled file formats (html, doc, php, ppt, pdf,
flash, etc.) and stores them in appropriate database tables.
Keyword Extraction & Translation: The user selects among different methods for
extracting keywords. Prior to the final keywords selection, all non-English keywords are
translated using an automated process (the system currently also supports Greek content).
All extracted keywords are stored in a database table along with their relevant frequency.
Keyword Category Mapping: The extracted keywords are mapped to categories of a
domain-specific ontology. The system finds the closest category to the keyword
through the mechanisms provided by a thesaurus (WordNet [WN]). The weighted
categories are stored in XML files and/or in a database table.
Session Management: SEWeP enables anonymous sessionizing based on distinct IPs
and a user-defined time limit between sessions. The distinct sessions are stored in XML
files and/or database tables. Figure 12 includes a screenshot of this module.
Semantic Association Rules Mining: SEWeP provides a version of the Apriori
algorithm [AS94] for extracting frequent itemsets and/or association rules (confidence
and support thresholds set by the user). Apart from URI-based rules, the system also
provides functionality for generating category-based rules. The results are stored in text
files for further analysis or use by the recommendation engine. Figure 13 includes a
screenshot of this module.
Clustering: SEWeP integrates clustering facilities for organizing the documents into
meaningful semantic clusters. Currently SEWEP capitalizes on the clustering tools
available in the THESUS system [VV+04].
44
45
Figure 13. SEWeP screenshot: the Semantic Association Rules Mining module
46
47
creating or importing taxonomies as well as the support for administrative functions such
as workflow and user management.
The Web Mining Layer is based on the C-logs creation & mining components of the
SEWeP architecture, enabling the semantic enhancement of web logs in order to create
the C-logs, which are in turn used as input to the Web Mining Module.
The Knowledge Management Layer is responsible for managing the knowledge
generated by the Web Mining layer and includes its deployment through various
recommendation engines (Recommendation Module).
Apart from these three general layers there is also an Interaction Layer, which
includes the Publishing Module and the web server, which will present the corresponding
personalized page to every user, by combining possibly fixed parts of the web page
with parts where the personalized information should be presented. More details on the
IKUM project can be found in [VEA03, EVA05].
3.9.3 The Greek Web Archiving Project
The objective of this project is to propose a framework for archiving the Greek Web.
This process involves the creation of an archive containing as many Greek web pages
as possible as well as the knowledge extraction from this collection. What should be
characterized as Greek Web is not solid since there exist many Greek web sites that are
not under the .gr top-level domain. Therefore, the main criteria we use in order to define
the Greek perimeter, apart from the domain name, are the Greek language and the
Hellenic-oriented content. In addition to collecting the data though, we also perform a
semantic characterization of the pages in order to group them into thematic clusters.
These clusters can subsequently be used to accelerate the search in the Web Archive and
enable the keyword-based search without human intervention.
The Greek Web Archiving system architecture is depicted in Figure 15. The system
consists of three main components: the Web Crawler, the Content Manager and the
Clustering Module. The Web Crawler searches the web using the aforementioned criteria
in order to gather as many Greek web pages as possible. The collected URIs are stored
in a database along with the date and time the crawling was performed, to enable
updating of the archive in the future. Some additional information such as the web pages
48
that point to, or are pointed by the URI can also be included for future use. The Content
Manager is in essence the Content characterization component of the SEWeP
architecture. Finally, the Clustering Module uses the K-means or the DBSCAN algorithm
and generates a label for each created cluster, taking the cluster centroids into account.
The system also integrates a Cluster Validation sub-module, in order to evaluate the
quality of the created clusters. Since the structure of the data was not known a-priori, we
used relative cluster validation criteria [HBV02], including the Dunn index, modified
Hubert Statistics and Davies-Boudlin index for this purpose. More information on this
project can be found in [LE+04].
49
3.10 Conclusions
In this Chapter we presented the key concepts and algorithms underlying SEWeP, a
novel semantic web personalization system. SEWeP is based on the integration of content
semantics with the users navigational behaviour in order to generate recommendations.
The web sites documents are automatically mapped to ontology terms, enabling further
processing (clustering, association rules mining, recommendations generation) to be
performed based on the semantic similarity between these terms. Using this
representation, the final recommendation set presented to the user is semantically
enhanced, overcoming problems emerging when pure usage-based personalization is
performed. Experimental results with real users have verified our claim that the semantic
enrichment of the personalization process improves the quality of the recommendations
in terms of complying with the users needs. Nevertheless, the recommendations
usefulness is a very subjective issue and therefore very difficult to evaluate. A general
observation that can be made is that out of the three possible recommendation sets the
system generates, the Semantic recommendation set, generated after semantic expansion
of the most popular association rule performs better, yet comparing all three
recommendation sets with the Hybrid one, we can conclude that this setup is the most
useful than the other three.
CHAPTER 4
Link Analysis for Web Personalization
The connectivity features of the web graph play important role in the process of web
searching and navigating. Several link analysis techniques, based on the popular
PageRank algorithm [BP98], have been largely used in the context of web search
engines. The underlying intuition of these techniques is that the importance of each page
in a web graph is defined by the number and the importance of the pages linking to it.
In this thesis, we introduce link analysis in a new context, that of web personalization.
Motivated by the fact that in the context of navigating a web site, a page/path is
important if many users have visited it before, we propose a new algorithm UPR (Usagebased PageRank). UPR is based on a personalized version of PageRank, favoring pages
and paths previously visited by many web site users. We apply UPR to a representation
of the web sites user sessions, termed Navigational Graph in order to rank the web sites
pages. This ranking may then be used in several contexts:
Use it as a global ranking of the web sites pages. The computed rank
probabilities can serve as the prior probabilities of the pages when
recommendations are generated using probabilistic predictive models such as
Markov Chains, higher-order Markov models, tree synopses etc.
Apply UPR to small subsets of the web sites navigational graph (or its
approximations), which are generated based on each current users visit. This
52
53
54
This is the intuition behind the hybrid algorithm we propose in this thesis, UPR. UPR
is in essence a usage-based variation of PageRank which is applied to the web sites
navigational graph, and provides us with authority scores that represent the importance of
the web sites pages both in terms of link connectivity, as well as their visit frequency.
In what follows we present some preliminaries concerning the navigational graph in
Section 4.2. Section 4.3 includes a detailed analysis of both the PageRank and UPR
algorithms. The two proposed personalization frameworks in which UPR can be applied
are presented in Sections 4.4 and 4.5. The experimental study is detailed in Section 4.6
whereas an overview of the prototype system implementing the proposed frameworks is
included in Section 4.7.
4.2 Preliminaries
The input to our proposed algorithm is the Navigational Graph (NG). NG is a
weighted directed graph representation of the user sessions. NG can be used in order to
discover page and path probabilities and support popular path prediction, since it contains
all the distinct user sessions, therefore is a full representation of the actual user paths
followed in the past. This structure, however, can become large, especially when
modeling the user sessions of big web sites. Therefore, the processing of NG may become
very intensive computationally. The need for reduced complexity and online availability
imposes the creation of approximations of the NG, referred to as NG synopses. An NG
synopsis may be a Markov model of any order (depending on the simplicity/accuracy
trade-off that is required), or any other graph synopsis, such as those proposed in [PG02,
PGI04]. We should stress at this point that our approach is orthogonal to the type of
synopsis one may choose. In what follows we present in more detail the NG structure and
its synopses, emphasizing on Markov models, since these are the NG synopses we are
using in the second framework we propose in this thesis as well as in the experimental
study we performed.
4.2.1 The Navigational Graph
As already mentioned, the Navigational Graph (NG) is a weighted directed graph
which represents the user sessions of a web site. In its simplest form, NG is a node- and
55
edge-labeled tree, that has as root a special node R and the labels of the nodes identify the
M web pages of the web site WS. Another option would be to encode the data as a graph
using a bisimulation of the tree-based representation. We stress that this choice is
orthogonal to the techniques that we introduce. The edges of NG represent the links
between the web pages (i.e. the paths followed by the users), and the labels (weights) on
edges represent the number of link traversals. The weighted paths from the root towards
the leaves represent all the user sessions paths that are included in the web logs. All tree
paths terminate in a special leaf-node E denoting the end of a path. The NG resembles to
the web sites graph, it may, however, include page links that do not physically exist (if,
for example a user jumps to a page from another following a bookmark), or, on the other
hand, may not include existing hyperlinks, if they were never followed in the past. Since
NG is a complete representation of the information residing on the web logs, there is a
high degree of replication of states in different parts of this structure.
The NG creation algorithm is as follows: For every user session US in the web logs,
we create a path starting from the root of the tree. If a subsequence of the session already
exists we update the weights of the respective edges, otherwise we create a new branch,
starting from the last visited common page in the path. We note that any consecutive
pages repetitions have been removed from the user sessions during the data cleaning
process; on the other hand, we keep any pages that have been visited more than once, but
not consecutively. We also denote the end of a session using a special exit node. The
algorithm for creating the NG is detailed in Figure 18.
56
Procedure CreateTree(U)
Input: User Sessions U
Output: Navigational Tree *NG
1. root <- NG;
2. tmpP <- root;
3. for every USU do
4. while US do
5.
si = first_state(US);
6.
if parent(tmpP,si) then
7.
wtmpP,I = wtmpP,I + 1;
8.
tmpP <- si;
9.
US <- remove(US, si);
10. else
11.
addchild(tmpP,si);
12.
wtmpP,I = 1;
13.
tmpP <- si;
14.
US <- remove(US, si);
15. endif
16. if parent(tmpP,E) then
17.
wtmpP,E = wtmpP,E + 1;
18. else
19.
addchild(tmpP,E);
20.
wtmpP,E = 1;
21. endif
22. done
23. tmpP <- NG;
24.done
Path
1
2
3
4
5
abcd
abed
acdf
bcbg
bcfa
57
58
may be represented as a stochastic process {Xn}, that has S as state space. If Pi ,( mj ) is the
bounded probability of visiting page j in the next step, and is based on the last m pages,
then {Xn} is called an mth-order Markov model [Kij97].
The simplest synopsis of NG is a Markov Chain. The Markov Chain is built upon the
Markov Property, which states that each next visit to a page depends only on the
current one and is independent of the previous ones. Therefore, a Markov Chain is a 1storder Markov model, and the bounded probability of visiting page xj in the next step is
given by Equation 4:
(4)
Equation 5:
) (
Pi ,( mj) = P X n+1 = x j | X n = xi , X n1 = xin 1 ,..., X 0 = xi0 = P X n+1 = x j | X n = xi ,..., X nm+1 = xin m+1
(5)
where the bounded probability of {Xn+1}, given all the previous events, equals the
bounded probability of {Xn+1} given the m previous events, for an mth order Markov
model.
The transition probabilities are easily computed using the information residing on
NG. We define the one-step transition probability matrix TP as follows: each item TPi,j
represents the probability of transitioning from page(s) xi to page xj in one step. In other
words,
TPi , j = P( x j | xi ) =
wij
wi
(6)
59
where wi represents the total number of visits to page(s) xi, and wij represents the number
of consecutive visits from xi to xj. Note that in case of paths having length l>1, we denote
as xi the prefix containing the first l-1 pages.
Table 4. Path Frequencies
l=1
xi
a
b
c
d
e
f
g
l=2
wi
4
5
4
3
1
2
1
xi xj
ab
ac
bc
be
bg
cb
cd
cf
df
ed
fa
l=3
wij
2
1
3
1
1
1
1
1
1
1
1
xi xj
abc
abe
acd
bcb
bcd
bcf
bed
cbg
cdf
cfa
wij
1
1
1
1
1
1
1
1
1
1
Table 4 includes the paths of length l 3 corresponding to the user sessions included
in Table 3. Using this information, and based on the previous analysis, we can compute
the transition probabilities for 1st and 2nd order Markov models NG synopses. The
respective 1st order Markov model (Markov Chain) synopsis is depicted in Figure 20. The
numbers in parentheses in the nodes denote the number of visits to a page whereas the
edges weights denote the times the respective link was followed. Nodes S and E
represent the paths start and end points respectively.
60
In the analysis that follows we use Markov models in two different frameworks. In
the first, we apply them in order to synopsize NG prior to applying the proposed localized
personalized ranking algorithm l-UPR. In the second, we propose Markov model-based
hybrid predictive models that incorporate link analysis techniques.
The PageRank algorithm is the most popular link analysis algorithm, used for
assigning numerical weightings to web documents that are used from web search engines
in order to rank the retrieved results. The algorithm models the behavior of a random
surfer, who either chooses an outgoing link from the page he is currently visiting, or
jumps to a random page. Each choice bears a probability. The PageRank of a page is
defined as the probability of the random surfer visiting this page at some particular time
step k > K. This probability is correlated with the importance of this page, as it is defined
based on the number and the importance of the pages linking to it. For sufficiently large
K this probability is unique, as illustrated in what follows.
Consider the web as a directed graph G, where the N nodes represent the web pages
and the edges represent the links between them. The random walk on G induces a
Markov Chain where the states are given by the nodes in G, and M is the stochastic
transition matrix with mij describing the one-step transition from page xj to page xi. The
adjacency function mij is 0 if there is no direct link from xj to xi, and normalized such that,
for each j:
61
ij
=1
(7)
i =1
r
r
PR * = M PR *
(8)
r
in other words PR * is the dominant eigenvector of the matrix M.
Since M is the stochastic transition matrix over the web graph G, PageRank is in
essence the stationary probability distribution over pages induced by a random walk on
(9)
In other words, the user may follow an outgoing link, or choose a random destination
(usually referred to as random jump) based on the probability distribution of U. The latter
process is also known as teleportation. PageRank can then be expressed as the unique
solution to Equation 8, if we substitute M with M:
r
r
r
P R = (1 ) M P R + p
r
where p is a non-negative N-vector whose elements sum to 1.
(10)
62
Usually mij =
k
xk Out ( x j )
1
U =
, i.e. the probability of teleporting to another page is uniform. In that case
N N N
r 1
p= .
N N 1
r
By choosing, however, U, and consequently p , to follow a non-uniform distribution,
we can bias the PageRank vector computation to favor certain pages (therefore the
r
random jump is no longer random!) Thus, p is usually referred to as the
personalization vector. This approach is largely used in the web search engines context,
where the ranking of the retrieved results are biased by favoring pages relevant to the
query terms, or the user preferences to certain topic categories [ANM04, H02, RD02,
WC+02]. In what follows, we present UPR, a usage-based personalized version of
PageRank algorithm, used for ranking the pages of a web site based on the navigational
behavior of previous visitors.
4.3.2 UPR: Link Analysis on the Navigational Graph
Based on the intuition that a page is important in a web site if many users have visited
it before, we introduce the hybrid link analysis algorithm UPR. UPR extends the
traditional link analysis algorithm PageRank, by biasing the page ranking with
knowledge acquired from previous user visits, as they are recorded in the user sessions.
In order to perform this, we define both the transition matrix M and the personalization
r
vector p in such way that the final ranking of the web sites pages is strongly related to
the frequency of visits to them.
Recapitulating from Section 4.2, we define the directed navigational graph NG, where
the nodes represent the web pages of the web site WS and the edges represent the
consecutive one-step paths followed by previous users. Both nodes and edges carry
weights. The weight wi on each node represents the number of times page xi was visited
and the weight wji on each edge represents the number of times xi was visited
63
immediately after xj. We denote the set of pages pointed to by xj (outlinks) as Out(xj), and
the set of pages pointing to xj (inlinks) as In(xj).
Following the aforementioned properties of the Markov theory and the PageRank
r
computation, the Usage-based PageRank vector UPR is the solution to the following
Equation:
r
r
r
UPR = (1 ) M UPR + p
(11)
The transition matrix M on NG is defined as the square N x N matrix whose elements mij
equal to 0 if there does not exist a link (i.e. visit) from page xj to xi and
mij =
w j i
xk Out ( x j )
(12)
jk
r
otherwise. The personalization vector p is defined as
r wi
p=
wj
x
j WS N 1
(13)
w ji
n 1
UPRi = UPR j
x j In ( xi )
( xw)jk
Out
k
j
wi
+ (1 )
wj
WS
j
(14)
64
Each iteration of UPR has complexity O(n2). The total complexity is thus determined
by the number of iterations, which in turn depends on the size of the dataset. In practice,
however, PageRank (and accordingly UPR) gives good approximations after 50 iterations
for =0.85 (which is the most commonly used value, recommended in [BP98]). The
computations can be accelerated by applying techniques such as those described in
[KHG03, KH+03] even though it is not necessary in the proposed frameworks since UPR
is applied to a single web site, therefore it converges after a few iterations.
In the Sections that follow, we present how UPR can be applied in different
personalization frameworks in order to assist the recommendations process.
65
66
The prNG construction algorithm is presented in Figures 23 and 24. The algorithm
complexity depends on the synopsis used, since the choice of the synopsis affects the
time needed for locating the successive pages for expanding the current path. It also
depends on the number of outgoing links of each sub-graphs page and the expansion
depth, d. Therefore, if the complexity of locating successive pages in a synopsis is k, the
complexity of the prNG creation algorithm is O (k * fanout ( NG ) d 1 ) , where fanout(NG)
is the maximum number of a nodes outgoing links in NG. In the case of Markov model
synopses, k=1 since the process of locating the outgoing pages of a page or path reduces
to the lookup in a hash table.
67
Since the resulting prNG includes all possible next page visits of the user, we then
apply UPR in order to rank them and generate personalized recommendations. The
personalized navigational sub-graph prNG should be built so as to retain the desirable
attributes for UPR to converge. The irreducibility of the sub-graph is always satisfied
since we have added the damping factor (1-) in the rank propagation. Moreover,
Equation 7 which states that the sum of all outgoing edges weights of every node in the
68
sub-graph equals to 1, is satisfied since we normalize them. Note here that prNG does not
include any previously visited pages.
Definition (l-UPR): We define l-UPRi of a page xi as the UPR rank value of this page in
The application of UPR or l-UPR to the navigational graph results in a ranked set of
pages which are subsequently used for recommendations. As already presented, the final
set of candidate recommendation pages can be either personalized or global, depending
on the combination of algorithm - navigational graph chosen:
1) Apply l-UPR to prNG. Since prNG is a personalized fraction of the NG synopsis,
this approach results in a personalized usage-based ranking of the pages most
likely to be visited next, based on the current users path.
2) Apply UPR to NG synopsis. This approach results in a global usage-based
ranking of all the web sites pages. This global ranking can be used as an
alternative in case personalized ranking does not generate any recommendations.
It can also be used for assigning page probabilities in the context of other
probabilistic prediction frameworks, as we will describe in the Section that
follows.
Finally, another consideration would be to have a pre-computed set of
recommendations for all popular paths in the web site, in order to save time during the
online computations of the final recommendation set.
69
sites pages and edges are the hyperlinks between them, and are in essence based in what
we have already described as NG synopses. Using the transitional probabilities between
pages as defined by the probabilistic model, a path prediction is made by selecting the
most probable path among candidate paths, based on each users visit. Such purely usagebased probabilistic models, however, present certain shortcomings. Since the prediction
of users' navigational behavior is solely based on the usage data, the structural properties
of the web graph are ignored. Thus important paths may be underrated. Moreover, as we
will also see in the experimental study we performed, such models are often shown to be
vulnerable to the training data set used.
In this Section we present a hybrid probabilistic predictive model (h-PPM) that
extends Markov models by incorporating link analysis methods. More specifically, we
choose the Markov models as NG synopses and use UPR and two more PageRank-style
variations of it, for assigning prior probabilities to the web pages based on their
importance in the web site's web and navigational graph.
4.5.1 Popular Path Prediction
P( x1 x2 ... xk ) = P( x1 ) * P( xi | xi m ...xi 1 )
i =2
(15)
70
For example, using a Markov Chain as the prediction model, the probability of the path
{abc} reduces to P(a b c) = P(a)P(b | a)P(c | b) = P(a)
P(a b) P(b c)
.
P(a)
P(b)
Based on Equation 15, the prediction of the next most probable page visit of a user is
performed by computing the probabilities of all existing paths having the pages visited so
far by the user as prefix and choosing the most probable one. The bounded probabilities
computation is straightforward since it reduces to a lookup on the transition probability
matrix TP. On the other hand, the prior probability assignment is an open issue, and we
deal with it in the sequel.
4.5.2 Reconsidering Prior Probabilities Computation
There are three approaches used commonly for assigning initial probabilities (priors)
to the nodes of a Markov model. The first one assigns equal probabilities to all nodes
(pages). The second estimates the initial probability of a page p as the ratio of the number
of visits on p as a first page in a path, to the total number of user sessions. In the case of
modeling web navigational behavior, however, neither of the aforementioned approaches
provides accurate results. The first approach assumes a uniform distribution, favoring
non-important web pages. On the other hand, the second does exactly the opposite: favors
only top-level entry pages. Furthermore, in the case of a page that was never visited
first, its prior probability equals to zero. The third approach is more objective with
regards to the other two, since it assigns prior probabilities proportionally to the
frequency of total visits to a page. This approach, however, does not handle important,
yet new (i.e. not included in the web usage logs) pages. Finally, as shown in the
experimental evaluation, all approaches are very vulnerable to the training data used for
building the predictive model.
In the literature, only a few approaches exist where the authors claim that these
techniques are not accurate enough and define different priors. Sen and Hansen [SH03]
use Dirichlet priors, whereas Borges and Levene [BL04] define a hybrid formula which
combines the two options (taking into consideration the frequency of visits to a page as
the first page, or the total number of visits to the page). For this purpose, they define the
variable , which ranges from 0 (for page requests as first page) to 1 (for total page
71
requests). In their experimental study, however, they dont explicitly refer to the optimal
value they used for a.
In this thesis, we address such shortcomings following an alternative approach. Our
motivation draws from the fact that the initial probability of a page should reflect the
importance of this page in the web navigation. We propose the integration of the web
sites topological characteristics, as represented by its link structure, with the navigational
patterns of its visitors used for computing these probabilities. More specifically, we
propose the use of three PageRank-style ranking algorithms for assigning prior
probabilities. The first (PR) is the PageRank algorithm applied on the web sites graph,
and computes the page prior probabilities based solely on the link structure of the web
site. The second is UPR, which, as already described, is applied on the web sites
navigational graph and favors pages previously visited by many users. The third
algorithm (SUPR) is a variation of UPR, which assigns uniform probabilities to the
random jump instead of biasing it as well.
Definition (PageRank-based Prior Probability): We define the prior probability P(xi)
of a page xi as:
P( xi ) = P (n ) ( xi ) = (1 ) * p( xi ) +
(P (
xk In ( xi )
n 1)
( x k ) * p ( x k , xi ) )
(16)
with (1-) being the damping factor (usually set to 0.15) and for
(i) PR (PageRank):
p ( xi ) =
1
M
and
p ( x k , xi ) =
(17)
j
x j Out ( xk )
p ( xi ) =
1
and
M
p ( x k , xi ) =
wki
wkj
(18)
x j Out ( xk )
wi
and p( xk , xi ) =
wj
x j WS
wki
wkj
x j Out ( xk )
(19)
72
Any of the aforementioned ranking schemes can be applied on the web sites web or
navigational graph (or its synopsis), resulting in a probability assignment for each one of
its pages. These probabilities can subsequently be used instead of the commonly used
priors for addressing the aforementioned problems. As we present in the experimental
study we have performed, this approach provides more objective and precise predictions
than the ones generated from the pure usage-based approaches.
In our experiments we used two publicly available data sets. The first one includes
the page visits of users who visited the msnbc.com web site on 28/9/99 [MSN]. The
visits are recorded at the level of URL category (for example sports, news, etc.). It
includes visits to 17 categories (i.e. 17 distinct pageviews). We selected 96.000 distinct
sessions including more than one and less than 50 page visits per session and split them
in two non-overlapping time windows to form a training (65.000 sessions) and a test
(31.000 sessions) data set. The second data set includes the sessionized data for the
DePaul University CTI web server, based on a random sample of users visiting the site
for a two week period during April 2002 [CTI]. The data set includes 683 distinct
pageviews and 13.745 distinct user sessions of length more than one. We split the
sessions in two non-overlapping time windows to form a training (9.745 sessions) and a
73
test (4.000 sessions) data set. We will refer to these data sets as msnbc and cti data set
respectively. We chose to use these two data sets since they present different
characteristics in terms of web site context and number of pageviews5. More specifically,
msnbc includes the visits to a very big portal. That means that the number of sessions, as
well as the length of paths is very large. This data set has however the characteristic of
very few pageviews, since the visits are recorded at the level of page categories. We
expect that the visits to this web site are almost homogeneously distributed among the 17
different categories. On the other hand, cti data set refers to an academic web site. Visits
to such sites are usually categorized in two main groups: visits from students looking for
information concerning courses or administrative material, and visits from researchers
seeking information on papers, research projects, etc. We expect that the recorded visits
will imply this categorization.
Since in all the experiments we created top-n rankings, in the evaluation step we used
two metrics commonly used for comparing two top-n rankings r1 and r2. The first one,
denoted as OSim(r1,r2) [H02] indicates the degree of overlap between the top-n elements
of two sets A and B (each one of size n) to be:
OSim ( r1 , r2 ) =
A B
n
(20)
The second, KSim(r1,r2) is based on Kendalls distance measure [KG90] and indicates the
degree to which the relative orderings of two top-n lists are in agreement and is defined
as:
KSim( r1 , r2 ) =
(21)
where r1 is an extension of r1, containing all elements included in r2 but not r1 at the end
of the list (r2 is defined analogously) [H02]. In other words, KSim takes into
consideration only the common items of the two lists, and computes how many pairs of
them have the same relative ordering in both lists. It is obvious that OSim is more
important (especially in small rankings) since it indicates the concurrence of predicted
5
We should note at this point that there does not exist any benchmark for web usage mining and
personalization. We therefore chose these two publicly available datasets which have been used again in
the past for experimentation in the web usage mining and personalization context.
74
pages with the actual visited ones. On the other hand, KSim must be always evaluated in
conjunction with the respective OSim since it can take high values even when only a few
items are common in the two lists.
At this point, we should discuss the methodology we chose for evaluating the
generated recommendations. There exist several related research efforts that propose a
general personalization architecture, without supporting their work with any experimental
evaluation [DM02, GKG05, ML+04, MPT99, OB+03, ZHH02b]. In this work we used a
commonly used methodology, dividing the data set into training and test data
respectively. According to this evaluation methodology, the training data are used in
order to generate the predictive model. The generated recommendations are in turn
compared to the actual user paths, as derived from the test data, using various metrics
[AG03, HLC+05, JPT03, JZM04b, JZM05, MD+00b, MPG03, NP03, SK+01]. Since,
however, the recommendations are compared to paths that have already been followed by
the users, it is questionable whether such a comparison evaluates the quality of
recommendations that include new paths. This issue is partially addressed by most
predictive models, since the generated recommendations include pages that are two or
more steps away. In real-life systems, this problem is addressed when the predictive
model is based on data extracted from an already personalized web site.
4.6.2 l-UPR Recommendations Evaluation
As already mentioned, the choice of the NG synopsis we use to model the user
sessions is orthogonal to the l-UPR framework. In this Section, we present results
regarding the impact of using our proposed method instead of pure usage-based
probabilistic models, focusing on Markov Chains.
We used 3 different setups for generating recommendations. The first two, referred to
as Start and Total, are the ones commonly used in Markov models for computing prior
probabilities. More specifically, Total assigns prior page probabilities proportional to the
total page visits, whereas Start assigns prior page probabilities proportional to the visits
beginning with this page. The third setup, referred to as l-Upr, is in essence our proposed
algorithm applied to a Markov Chain-based prNG. For the l-Upr setup, we set the
75
damping factor (1-) to 0.15 and the number of iterations to 100 to ensure convergence.
We expand each path to depth d=2.
The experimental scenario is as follows: We select the 10 most popular paths
comprising of two or more pages from the test data set. For each such path p, we make
the assumption that it is the current path of the user and generate recommendations
applying the aforementioned approaches on the training data set. Using the first two
setups, we find the n pages having higher probability to be visited after p. On the other
hand, using our approach, we expand p to create a localized sub-graph and then apply lUPR to rank the pages included in it. We then select the top-n ranked pages. This process
results in three recommendation sets for each path p. At the same time, we identify, in the
test data set, the n most frequent paths that extend p by one more page. We finally
compare, for each path p, the generated top-n page recommendations of each method
(Start, Total, l-Upr) with the n most frequent next pages, using the OSim and KSim
metrics.
We run the experiments generating top-3 and top-5 recommendation lists for each
setup. We performed the experiments using small recommendation sets because this
resembles more to what happens in reality, i.e. the system recommends only a few next
pages to the user. The diagrams presented here, show the average OSim and KSim
similarities over all 10 paths.
Figure 25 depicts the average OSim and KSim values for the top-3 and top-5 rankings
generated for the msnbc data set. In the first case (top-3 page predictions) we observe that
l-Upr behaves slightly worse in terms of prediction accuracy (OSim) but all methods
achieve around 50% accuracy. The opposite result is observed in the second case (top-5
page predictions), where l-Upr behaves better in prediction accuracy than the other two
methods, and the overall prediction accuracy is more than average. In both cases we
observe a lower KSim, concluding that l-Upr managed to predict the next pages but not
in the same order (as they were actually visited). As we mentioned earlier, however, the
presentation order is not so important in such a small recommendation list. Overall, the
differences between the three methods are insignificant. This can be justified if we take
into account the nature of the data set used. As already mentioned, the number of distinct
76
pageviews of the data set is very small and therefore the probability of coinciding in the
predictions is the same, irrespective of the method used.
MSNBC data set - top 3 recommendations
0.8
average similarity
average similarity
0.7
0.6
0.5
Start
Total
l-UPR
0.4
0.3
0.2
0.1
0
Osim
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Start
Total
l-UPR
Osim
Ksim
Ksim
Figure 25. Average OSim and KSim of top-n rankings for msnbc data set
In order to conclude on whether the number of distinct pageviews is the one affecting
the prediction accuracy of the three methods, we performed the same experimental
evaluation on the second data set, cti. Figure 26 depicts the average OSim and KSim
values for the top-3 and top-5 rankings generated for the cti data set. We observe that in
both cases l-Upr outperforms the other two methods both in terms of prediction accuracy
(OSim) and relative ordering (KSim). This finding supports our intuition, that in the case
of big web sites that have many pageviews, the incorporation of structure data in the
prediction process enhances the accuracy of the recommendations.
CTI data set - top 5 recommendations
0.6
average similarity
0.4
Start
Total
l-UPR
0.3
0.2
0.1
average similarity
0.8
0.5
0.7
0.6
Start
Total
l-UPR
0.5
0.4
0.3
0.2
0.1
0
0
Osim
Ksim
Osim
Ksim
Figure 26. Average OSim and KSim of top-n rankings for cti data set
Examining all findings in total, we verify our claim that l-UPR performs the same as,
or better than commonly used probabilistic prediction methods. Even though the
prediction accuracy in both experiments is around 50%, we should point out that this
value represents the average OSim over 10 distinct top-n rankings. Examining the
77
78
methods, Start and Total, whereas it is more than 80% for the three proposed methods.
KSim, on the other hand, exceeds 90% for all rankings in the case of our proposed
methods, whereas it is high only for the first three rankings for Start setup.
OSim
Start
Total
PR
SUPR
UPR
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
top 3
top 5
top 10
top 20
Figure 27. OSim for msnbc data set, Markov Chain NG synopsis
msnbc data set - KSim for MC
KSim
Start
Total
PR
SUPR
UPR
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
top 3
top 5
top 10
top 20
Figure 28. KSim for msnbc data set, Markov Chain NG synopsis
The diagrams of Figures 29 and 30 depict the OSim and KSim similarities for the top
3, 5, 10, and 20 rankings of the cti data set. In this case, the rankings acquired by
applying the two common methods did not match with the actual visits at all, giving a 0%
OSim and KSim similarity! On the other hand, all three proposed methods reached an
average of 80% OSim and 90% KSim in all setups, with SUPR slightly outperforming PR
and UPR.
79
OSim
Start
Total
PR
SUPR
UPR
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
top 3
top 5
top 10
top 20
Figure 29. OSim for cti data set, Markov Chain NG synopsis
cti data set - KSim for MC
KSim
Start
Total
PR
SUPR
UPR
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
top 3
top 5
top 10
top 20
Figure 30. KSim for cti data set, Markov Chain NG synopsis
At this point, we should analyze the behavior of the Start and Total setups, which
represent the straightforward Markov model implementation. The outcomes of the
experiments verify our claim that Markov models are very vulnerable to the training data
used, and several pages may be overrated or underestimated in certain circumstances. In
the case of the msnbc data set, where the number of distinct pages was very small and
therefore the navigational paths were evenly distributed, the pure usage-based models
seem to behave fairly (but, again, worse than the hybrid models). On the other hand, in
the case of the cti data set, where hundreds of distinct pages (and therefore distinct paths)
existed, the prediction accuracy of usage-based models was disappointing! We examined
80
the produced top-n rankings of the two usage-based approaches, and observed that they
include only the visits of students to course material. Since probably many students
visited the same pages and paths in that period of time, accessing the pages directly
(probably using a bookmark), these visits overlapped any other path visited by any other
user. On the other hand, by taking into consideration the objective importance of a
page, as conveyed by the link structure of the web site, such temporal influences are
reduced. The reader may refer to Appendix B that includes the top-10 ranked paths which
were generated using the Start and Total setups, as well as the 10 most frequent ones,
which were used as the test data set on our experiments.
The framework proposed in this Chapter can be directly applied for computing the
prior probabilities of visiting the pages of a web site. In other words, this framework can
be directly applied to Markov Chain NG synopses. In the case of higher-order Markov
models, however, our intuition was that this framework should be extended for
supporting the computation of prior probabilities for path visits (up to some length,
depending on the order). For instance, a 2nd-order Markov model is based on the
assumption that we have prior knowledge concerning the visit probabilities of all paths
including up to 3 pages. Indeed, the results from applying the proposed algorithms to the
cti dataset indicated the need for this model extension. In the case of the msnbc dataset,
however, we did not observe any significant deviation of the results. This can be
explained by the fact that msnbc has only a few distinct nodes, hence a small number of
different distinct paths a user can follow. As already mentioned, in this data set the users
visits were almost uniformly distributed across all web sites page categories. Therefore
the probability of visiting two pages consecutively is very well approximated by the
probability of visiting the last page (almost independent of the page the user was
previously visiting). In what follows, we present the results of this experiment. We
present some preliminary ideas concerning possible extensions of this framework in the
final Chapter.
The results of the set of experiments we performed using the 2nd-order Markov
models as NG synopsis on the msnbc data set are included in the diagrams of Figures 31
and 32. We observe that in the case of 2nd-order Markov models the winner is UPR
81
followed by SUPR and Total setups. A very interesting fact is that the pure link-based
approach, PR, gives the worst results, having 0% OSim for the top-3 and top-5 rankings
and only 20% OSim for the top-10 ranking. This can be explained by the fact that PR,
which is in essence the application of PageRank algorithm on the web sites graph,
represents the steady state vector of the Markov Chain, as it is defined on the web graph.
Therefore, in the case of modeling the web graph as an NG synopsis other than the
Markov Chain, it isnt as efficient. On the other hand, the hybrid usage/link ranking
algorithms outperform the two commonly used usage-based approaches in most cases.
msnbc data set - OSim for 2MM
OSim
Start
Total
PR
SUPR
UPR
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
top 3
top 5
top 10
top 20
Figure 31. OSim for msnbc data set, 2nd-order Markov model NG Synopsis
msnbc data set - KSim for 2MM
KSim
Start
Total
PR
SUPR
UPR
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
top3
top5
top10
top20
Figure 32. KSim for msnbc data set, 2nd-order Markov model NG Synopsis
Overall, comparing the three proposed methods, we observe that, for the msnbc data
set, all methods have the same OSim when a Markov Chain synopsis is used, whereas
82
UPR outperforms the other two when a 2nd-order Markov model synopsis is used. On the
other hand, in the case of the cti data set, we observe that SUPR outperforms the other
two methods. Nevertheless, there is no prevalent underlying pattern between the number
of recommendations and OSim/KSim. Therefore, we cannot conclude on the superiority
of one of the proposed methods, other than that it strongly depends both on the data set
and the NG synopsis used.
4.6.4 Comparison of l-UPR and h-PPM
In the last part of the experimental evaluation, we compared the two proposed
frameworks, namely, l-UPR and h-PPM. For this purpose, we used the same
methodology we followed when evaluating the l-UPR framework, as described in Section
4.6.2: we build the navigational graph from the test data set and select the 10 most
popular paths comprising of two or more pages. For each such path p, we make the
assumption that it is the current path of the user and generate recommendations applying
the aforementioned approaches on the training data set. Using l-UPR, we expand p to
create a localized sub-graph and then apply the algorithm to rank the pages included in
the sub-graph. Using h-PPM, we find the n pages having higher probability to be visited
after p. We then select the top-n ranked pages. At the same time, we identify in the test
data set the n most frequent paths that extend p by one more page. We finally compare,
for each path p, the generated top-n page recommendations of each method (l-UPR, hPPM) with the n most frequent next pages, using the OSim metric. We omit the KSim
results here, since, as already mentioned, they are not very important for such small
recommendation sets.
We applied this methodology to both data sets, generating top-3 and top-5
recommendation sets. For generating recommendations using the h-PPM framework, we
present here the variation that behaved better in the previous experiments, namely, UPR
for the msnbc data set and SUPR for the cti data set. We should point out, however, that
all variations produce almost the same recommendations for such small sets, as already
implied in Section 4.6.3. The experimental results are depicted in Figure 33.
83
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.5
0.6
0.5
OSim
OSim
l-UPR
l-UPR
0.4
h-PPM
0.4
0.3
0.3
0.2
0.2
0.1
0.1
top 3
top 5
top 3
top 5
We observe that the relevant prediction accuracy of each method depends on the size
of the recommendation set and the data set that is used. Therefore, h-PPM has better
prediction accuracy for small recommendation sets, whereas l-UPR is slightly better for
the bigger recommendation sets. Nevertheless, the differences between the two methods
are minor and we cannot draw any conclusions, other than repeating that the final choice
heavily depends on the data set we want to model.
84
always known. The system provides the web graph reconstruction module, which takes as
input the web sites user sessions and reconstructs the web sites graph based on this
information. The output of this process is an XML file.
Navigational Graph Creation (h-PPM): The proposed algorithm, UPR, is based on
the application of link analysis over the web sites Navigational Graph, NG. This module
enables the creation of NG from the web sites user sessions in order to be used in
subsequent computations. The NG is stored in a hash file.
Prior Probabilities Computation (h-PPM): This module enables the computation of
the prior probabilities defined on Section 4.5.2, namely, PR, SUPR and UPR. The system
also provides the functionality for computing probabilities based on the page visits
frequencies, used by Markov models, namely, Start and Total (prior probabilities are
proportional to the number of visits to a page as the first page in the session or to the total
number of visits to a page, respectively). This process takes as input the parameters of the
chosen probability computation method. The prior probabilities computed using either
one of the five alternative methods are stored in a hash table and used by the Path
Probability Computation module. The results, as well as a log including all the iterations
of the link analysis-based algorithms are also saved in files. A screenshot of this module
is shown in Figure 34.
Path Probabilities Computation (h-PPM): The system enables the popular path
prediction using either the h-PPM framework, or Markov models. The priors used in each
method are pre-computed by the Prior Computation module. This module enables the
prediction of the n most probable next visits for any sub-path of NG, along with the
respective probabilities. It also enables the prediction of the top-n popular paths. This
information is output to files for further analysis. Figure 35 includes a screenshot of this
module.
l-UPR Path Prediction (l-UPR): This module implements the l-UPR recommendation
framework. It incorporates functionality for creating the NG using the web sites user
sessions, whereas the NG synopsis can be a Markov model of any order. The module
takes as input the path (current visitors path) and the parameters of the l-UPR algorithm
85
and outputs the recommended paths in a file. A log file of the l-UPR iteration is also
created. A screenshot of this module is included in Figure 36.
OSim-KSim (h-PPM & l-UPR): This module implements the two similarity measures
employed in our experimental study. This process takes as input two top-n lists and
outputs the respective OSim and KSim similarities.
86
87
4.8 Conclusions
There is a wealth of recommendation models for personalizing a web site based on
the navigational behavior of past users. Most of the models, however, are solely based on
the web sites usage data ignoring the link structure of the web graph visited. In this
Chapter we presented how link analysis can be integrated in the web personalization
process. We propose a novel algorithm, UPR, which is applicable to any navigational
graph synopsis, and provides ranked recommendations to the visitors of a web site,
capitalizing on the structural properties of the navigation graph. We presented UPR in the
context of two different personalization frameworks, l-UPR and h-PPM. In the first
framework, a localized version of UPR is applied to a personalized sub-graph of the NG
synopsis and is used to create online personalized recommendations to the visitors of the
web site. The second approach addresses several shortcomings of pure usage-based
probabilistic predictive models, by incorporating link analysis techniques in such models
in order to support popular paths prediction. The experiments we have performed for
both frameworks are more than promising, outperforming existing approaches.
88
CHAPTER 5
Conclusions and Future Research
5.1 Thesis Summary
The World Wide Web grows at a tremendous pace, and its impact as the main source
of information acquisition is increasing dramatically. Because of its rapid and chaotic
growth, the resulting network of information lacks organization and structure, making
web site exploration difficult. To address the requirement of effective web navigation, the
web sites provide personalized recommendations to the end users.
Most of the research efforts in web personalization correspond to the evolution of
extensive research in web usage mining, i.e. the exploitation of the navigational patterns
of the web sites visitors. When a personalization system relies only on usage-based
results, however, valuable information conceptually related to what is finally
recommended may be missed. Moreover, the structural properties of the web site are
often disregarded. In this thesis, we present novel techniques that incorporate the content
semantics and the structural properties of a web site in the web personalization process.
In the first part of our work we present a semantic web personalization system.
Motivated by the fact that if a personalization system is only based on the recorded
navigational patterns, important information that is semantically similar to what is
recommended might be missed, we propose a web personalization framework (SEWeP)
that integrates usage data with content semantics, expressed in ontology terms, in order to
89
90
5.2 Discussion
SEWeP was introduced in a time when the Semantic Web vision was rather new, and
people were just starting to exploit some of its principal ideas, structures, languages and
protocols. Since then, some of them were shown to be insufficient and were abandoned,
whereas others, with ontologies being one of them, are now broadly accepted and used in
many different applications. SEWeP exploits ontologies in order to represent web content
and the users navigational behavior. The exploitation of content semantics (either using
ontologies or not) in the web usage mining or the web personalization process has been
the subject of many studies that followed (or were carried on in parallel to) SEWeP
[AG+03, OB+03, ML+04, MSR04, GKG05].
There exist, however, several open issues in this area. Since most of the existing web
sites do not have an underlying semantic infrastructure, and due to the size of most web
sites, it is very difficult to annotate the content by hand, as most approaches imply. It is
evident that the content characterization process should be performed automatically. One
of the most crucial parts of this process is the mapping of the extracted features to
ontology terms, since an inappropriate mapping would eventually result in inaccurate
recommendations. Therefore, one of the most important updates to the SEWeP system
would be to incorporate even more effective semantic similarity measures, such as those
proposed in [MTV04, MT+05].
91
92
In this thesis we have supported the initial intuition that link analysis can be used in
several different contexts in order to support web personalization through an extensive
experimental evaluation process. As we have already pointed out, the priors defined in hPPM framework are directly applicable to Markov Chains, but do not always work for
higher-order Markov models. UPR and its variations compute probabilities for the
navigational graphs nodes, i.e. the web sites pages. In higher order Markov models, we
need such probabilities for the web sites paths too. One solution would be to create
summarizing nodes including all the paths, or only the most popular ones, and then apply
UPR on this aggregate navigational graph. This would result in UPR values for paths
which could subsequently be used in the h-PPM context. This issue remains open for
future work.
Our future plans involve the application of l-UPR on different NG synopses. As
shown in the experimental evaluation, l-UPR is a very promising recommendation
algorithm. In our study we applied it on the Markov Chain NG synopsis. We expect better
results in the case of more complex NG synopses, which approximate more accurately the
navigational graph.
Another issue that should be taken into consideration in the process of assigning
importance scores in the web pages of a web site is the freshness and trends in the
web navigation context. We believe that pages/paths with more recent visits, or
increasing rate of visits should be favored in the recommendation process [BVW04] and
aim to incorporate this intuition in our future work.
Moreover, we plan to investigate how this hybrid usage-structure ranking can be
applied to a unified web/navigational graph which expands out of the limits of a single
web site. Such approach would enable a global importance ranking over the web,
enhancing both web search results and the recommendation process.
Finally, we conclude with our vision for web personalization systems. This thesis has
shown how the integration of content semantics or link analysis techniques can improve
the recommendation process. We believe that the next step would be to exploit all these
data, namely, usage, content and structure, in a single, unified framework. This
framework
need
not
be
specifically
focused
on
generating
personalized
93
recommendations, but should cover many web applications such as web search results
ranking, or web content (blogs/bookmarked pages/multimedia etc.) categorization.
95
LIST OF REFERENCES
[ADW02] C. Anderson, P. Domingos, D. S. Weld, Relational Markov Models and their
Application to Adaptive Web Navigation, in Proc. of the 8th ACM SIGKDD
Conference, Canada (2002)
[AG03] S. Acharyya, J. Ghosh, Context-Sensitive Modeling of Web Surfing Behaviour
Using Concept Trees, in Proc. of the 5th WEBKDD Workshop, Washington DC (2003)
[ANM04] M.S. Aktas, M.A. Nacar, F. Menczer, Personalizing PageRank Based on
Domain Profiles, in Proc. of the 6th WEBKDD Workshop, Seattle (2004)
[AP+04] M. Albanese, A. Picariello, C. Sansone, L. Sansone, A Web Personalization
System based on Web Usage Mining Techniques, in Proc. of WWW2004, New York
(2004)
[AS94] R. Agrawal, R. Srikant, Fast Algorithms for Mining Association Rules, in Proc. of
20th VLDB Conference (1994)
[B02] B. Berendt, Using site semantics to analyze, visualize and support navigation, in
Data Mining and Knowledge Discovery Journal, 6: 37-59 (2002)
[BB+99] A.G. Buchner, M. Baumgarten, S.S. Anand, M.D. Mulvenna, J.G. Hughes,
Navigation pattern discovery from Internet data, in Proc. of the 1st WEBKDD
Workshop, San Diego (1999)
[BG+04] V. Bacarella, F. Giannoti, M. Nannni, D. Pedrsechi, Discovery of Ads Web
Hosts through Traffic Data Analysis, in Proc. of the 9th ACM SIGMOD Workshop on
Research Issues in Data Mining and Knowledge Discovery (DMKD 04), Paris,
France (2004)
[BHS02] B. Berendt, A. Hotho, G. Stumme, Towards Semantic Web Mining, in Proc. of
the 1st Intl. Semantic Web Conference (ISWC 2002)
[BL99] Jose Borges, Mark Levene, Data Mining of User Navigation Patterns, in Web
Usage Analysis and User Profiling, published by Springer-Verlag as Lecture Notes in
Computer Science, 1836: 92-111
[BL04] J. Borges, M. Levene, A Dynamic Clustering-Based Markov Model for Web
Usage Mining, Technical Report, available at http://xxx.arxiv.org/abs/cs.IR/0406032
(2004)
[BL06] J. Borges, M. Levene, Ranking Pages by Topology and Popularity within Web
Sites, accepted for publication in World Wide Web Journal (2006)
[BP98] S. Brin, L. Page, The anatomy of a large-scale hypertextual Web search engine,
Computer Networks, 30(1-7): 107-117 (1998)
[BS00] B. Berendt, M. Spiliopoulou, Analysing navigation behaviour in web sites
integrating multiple information systems, The VLDB Journal 9(1):56-75 (2000)
[BS04] R. Baraglia, F. Silvestri, An Online Recommender System for Large Web Sites, in
Proc. of ACM/IEEE Web Intelligence Conference (WI04), China (2004)
96
97
98
[HN+01] Zhexue Huang, Joe Ng, David W. Cheung, Michael K. Ng, Wai-Ki Ching, A
Cube Model for Web Access Sessions and Cluster Analysis, in Proc. of the 3rd
WEBKDD Workshop (2001)
[HN+03] M. Halkidi, B. Nguyen, I. Varlamis, M. Vazirgiannis, THESUS: Organizing
Web Documents into Thematic Subsets using an Ontology, VLDB journal, 12(4): 320332, (2003)
[JF+97] T. Joachims, D. Freitag, T. Mitchell, WebWatcher: A Tour Guide for the World
Wide Web, in Proc. of IJCAI97, (1997)
[JPT03] S. Jespersen, T.B. Pedersen, J. Thorhauge, Evaluating the Markov Assumption
for Web Usage Mining, in Proc. of ACM WIDM 2003, Lousiana, (2003)
[JZM04a] X. Jin, Y. Zhou, B. Mobasher, Web Usage Mining Based on Probabilistic
Latent Semantic Analysis, in Proc. of the 10th ACM SIGKDD Conference on
Knowledge Discovery and Data Mining (KDD'04), Seattle, (2004)
[JZM04b] X. Jin, Y. Zhou, B. Mobasher, A Unified Approach to Personalization based
on Probabilistic Latent Semantic Models of Web usage and Content, in Proceedings of
AAAI Workshop on Semantic Web Personalization (SWP04), (2004)
[JZM05] X. Jin, Y. Zhou, B. Mobasher, A Maximum Entropy Web Recommendation
System: Combining Collaborative and Content Features, in Proceedings of the ACM
SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'05), Chicago
(2005)
[Kij97] M. Kijima, Markov Processes for Stochastic Modeling, Chapman & Hall,
London, (1997)
[KS04] J. Kleinberg, M. Sandler, Using Mixture Models for Collaborative Filtering, in
Proc. of ACM Symposium on Theory of Computing (STOC04), (2004)
[KG90] M. Kendall, J.D.Gibbons, Rank Correlation Methods, Oxford University Press,
(1990)
[KHG03] S.D. Kamvar, T.H. Haveliwala, and G.H. Golub, Adaptive Methods for the
Computation of PageRank, in Proc. of the Intl. Conference on the Numerical Solution
of Markov Chains, (2003)
[KH+03] S.D. Kamvar, T.H. Haveliwala, C.D. Manning, and G.H. Golub, Extrapolation
Methods for Accelerating PageRank Computations, in Proc. of the 12th Intl. World
Wide Web Conference, (2003)
[KJ+01] R. Krishnapuram, Anupam Joshi, Olfa Nasraoui, Liyu Yi, Low-Complexity
Fuzzy Relational Clustering Algorithms for Web Mining, in IEEE Transactions of
Fuzzy Systems, (2001)
[LE+04] C. Lampos, M. Eirinaki, D. Jevtuchova, M. Vazirgiannis, Archiving the Greek
Web, in Proc. of the 4th Intl. Web Archiving Workshop (IWAW04), Bath, UK (2004)
[LL03] M. Levene, G. Loizou, Computing the Entropy of User Navigation in the Web, in
Intl. Journal of Information Technology and Decision Making, 2: 459-476, (2003)
[MD+00a] B. Mobasher, H. Dai, T. Luo, Y. Sung, J. Zhu, Discovery of Aggregate Usage
Profiles for Web Personalization, in Proc. of 2nd WEBKDD Workshop, Boston (2000)
99
[MD+00b] B. Mobasher, H. Dai, T. Luo, Y. Sung, J. Zhu, Integrating web usage and
content mining for more effective personalization, in Proc. of the Intl. Conference on
Ecommerce and Web Technologies (ECWeb), Greenwich, UK (2000)
[ML+04] R. Meo, P.L. Lanzi, M. Matera, R. Esposito, Integrating Web Conceptual
Modeling and Web Usage Mining, in Proc. of the 6th WEBKDD Workshop, Seattle
(2004)
[MPG03] E. Manavoglu, D. Pavlov, C.L. Giles, Probabilistic User Behaviour Models, in
Proc. of the 3rd Intl. Conference on Data Mining (ICDM 2003)
[MPT99] F. Masseglia, P. Poncelet, M. Teisseire, Using Data Mining Techniques on Web
Access Logs to Dynamically Improve Hypertext Structure, in ACM SigWeb Letters,
8(3):13-19, (1999)
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms, Cambridge University
Press, United Kingdom (1995)
[MSN] msnbc.com Web Log Data, available from UCI KDD Archive,
http://kdd.ics.uci.edu/databases/msnbc/msnbc.html
[MSR04] S. E. Middleton, N. R. Shadbolt, D. C. De Roure, Ontological User Profiling in
Recommender Systems, ACM Transactions on Information Systems (TOIS), 22(1):5488 (2004)
[MTV04] D. Mavroeidis, G. Tsatsaronis, M. Vazirgiannis, Semantic Distances for Sets of
Senses and Applications in Word Sense Disambiguation, in Proc. of the 3rd Intl.
Workshop on Text Mining and its Applications, Athens, Greece (2004)
[MT+05] D. Mavroeidis, G. Tsatsaronis, M. Vazirgiannis, M. Theobald, G. Weikum,
Word Sense Disambiguation for Exploiting Hierarchical Thesauri in Text
Classification, in Proc. of the ECML/PKDD 2005 Conference, Porto, Portugal (2005)
[NC+03] O. Nasraoui, C. Cardona, C. Rojas, F. Gonzales, Mining Evolving User Profiles
in Noisy Web Clickstream Data with a Scalable Immune System Clustering Algorithm,
in Proc. of the 5th WEBKDD Workshop, Washington DC (2003)
[NM02] A. Nanopoulos, Y. Manolopoulos, Efficient Similarity Search for Market Basket
Data, in the VLDB Journal, (2002)
[NM03] M. Nakagawa, B. Mobasher, A Hybrid Web Personalization Model Based on
Site Connectivity, in Proc. of the 5th WEBKDD Workshop, Washington DC (2003)
[NP03] O. Nasraoui, C. Petenes, Combining Web Usage Mining and Fuzzy Inference for
Website Personalization, in Proc. of the 5th WEBKDD Workshop, Washington DC
(2003)
[NP04] O. Nasraoui, M. Pavuluri, Complete this Puzzle: A Connectionist Approach to
Accurate Web Recommendations based on a Committee of Predictors, in Proc. of the
6th WEBKDD Workshop, Seattle, (2004)
[OB+03] D. Oberle, B. Berendt, A. Hotho, J. Gonzalez, Conceptual User Tracking, in
Proc. of the 1st Atlantic Web Intelligence Conf. (AWIC), (2003)
[P80] M. F. Porter, An algorithm for suffix stripping, Program, 14(3):130-137, (1980)
100
101
103
APPENDIX A
In order to evaluate the usefulness of the SEWeP framework, we presented the users
with three paths, each one having a different objective; one (A) containing visits to
contextually irrelevant pages (random surfer), a second (B) including a small path to very
specialized pages (information seeking visitor), and a third one (C) including visits to
top-level, yet research-oriented pages (topic-oriented visitor). This is the 2nd blind test
that was presented to the users, evaluating the usability of original vs. hybrid
recommendations. Note that we presented unlabeled recommendation sets to the users.
Path A
http://www.db-net.aueb.gr/people.htm
http://www.db-net.aueb.gr/links.htm
http://www.db-net.aueb.gr/courses/courses.htm ()
Recommendations
A.1 (HYBRID)
http://www.db-net.aueb.gr/pubs.php
http://www.db-net.aueb.gr/research.htm
http://www.db-net.aueb.gr/courses/postgrdb/asilomar.html
A.2 (ORIGINAL)
http://www.db-net.aueb.gr/pubs.php
http://www.db-net.aueb.gr/pubsearch.php
http://www.db-net.aueb.gr/research.htm
104
Path B
http://www.db-net.aueb.gr/people/michalis.htm
http://www.db-net.aueb.gr/mhalk/CV_maria.htm ()
Recommendations
B.1 (HYBRID)
http://www.db-net.aueb.gr/mhalk/Publ_maria.htm
http://www.db-net.aueb.gr/research.htm
http://www.db-net.aueb.gr/magda/papers/webmining_survey.pdf
B.2 (ORIGINAL)
http://www.db-net.aueb.gr/mhalk/Publ_maria.htm
http://www.db-net.aueb.gr/papers/gr_book/Init_frame.htm
http://www.db-net.aueb.gr/papers/gr_book/Contents.htm
Path C
http://www.db-net.aueb.gr/index.php
http://www.db-net.aueb.gr/research.htm
http://www.db-net.aueb.gr/people.htm ()
Recommendations
C.1 (ORIGINAL)
http://www.db-net.aueb.gr/projects.htm
http://www.db-net.aueb.gr/courses/courses.htm
http://www.db-net.aueb.gr/courses/courses.php?ancid=dm
C.2 (HYBRID)
http://www.db-net.aueb.gr/projects.htm
http://www.db-net.aueb.gr/courses/courses.htm
http://www.db-net.aueb.gr/courses/POSTGRDB/ballp.pdf
105
APPENDIX B
We present here the top-10 ranked paths generated using the Start and Total setups
(in Tables 6 and 7 respectively), as well as the 10 most frequent ones (in Table 5),
extracted from the test data set we used in our experiments for the h-PPM framework.
We observe that the rankings of the first two approaches represent the visits of students to
course material. We assume that in that period of time, when the data set was collected,
many students visited the same pages and paths, accessing them directly (probably by a
bookmarked page). Therefore, their visits dominated any other path visited by any other
user. On the other hand, by taking into consideration the objective importance of a
page, as denoted by the link structure of the web site, such temporal influence is reduced.
We omit the top-10 ranked paths generated using the PR, SUPR and UPR algorithms,
since they are very similar to the Frequent paths ranking, as shown by the experimental
results (Figure 28).
106