Вы находитесь на странице: 1из 119

Magdalini P.

Eirinaki

NEW APPROACHES
TO WEB PERSONALIZATION
Ph.D. THESIS

ATHENS UNIVERSITY OF ECONOMICS AND BUSINESS


Dept. of Informatics

May 2006

Copyright by Magdalini P. Eirinaki


2006

ii

ACKNOWLEDGEMENTS
...And if you find her poor, Ithaka won't have fooled you. Wise as you will have
become, so full of experience, you will have understood by then what these Ithakas
mean." Constantine P. Cavafis (1863-1933)
There are many people that I need to thank for making this long journey so
memorable. First of all, I thank my advisor, Michalis Vazirgiannis, for believing in me
and supporting me all these years, providing me with valuable advice, and giving me the
opportunity to travel to several places and meet very interesting people during project
meetings or conferences.
I would also like to thank the members of my PhD examination committee, namely,
professors Ioannis Milis, Gerhard Weikum, Emmanouil Yakoumakis, Emmanouil
Yannakoudakis, Martha Sideri and Vassilis Vassalos.
I would like to extend my sincerest thanks to my collaborators during my PhD. First
of all, Iraklis Varlamis, for our fruitful discussions that constituted my first steps in
research. Also, Giorgos Tsatsaronis, Dimitris Kapogiannis, and especially Charalampos
Lampos and Stratos Pavlakis, that worked really hard as undergraduate students offering
their excellent implementation skills, as well as valuable insights concerning our work. I
also thank Sarabjot S. Anand and Joannis Vlachakis, my collaborators during a European
project.
As a member of the DB-NET group, I had the chance to meet and befriend many
people. My thanks go to Maria Halkidi, Yannis Batistakis, Christos Pateritsas, Euripides
Vrachnos, Christoforos Ververidis, Christos Doulkeridis, Giorgos Tsatsaronis, Dimitris
Mavroeidis, our wonderful secretary Viky Sambani, and my good friends, Iraklis
Varlamis and Stratis Valavanis, for making these years fun and carefree, even during our
numerous moves, or our deadlines.
A friend is one who believes in you when you have ceased to believe in yourself.
There are many times during ones PhD when one wants to give up. My gratitude goes to
all my friends (thankfully too many to be mentioned individually), especially Elena

iii

Avatagelou, Nikos Karelos, Matoula Kalyveza, my brother Pavlos Eirinakis and my very
best friend, Foteini Glykou, for being there for me.
I should thank the person that motivated me to become a computer scientist, my
uncle, professor Panagiotis Varelas. Throughout the years, he was always challenging me
with brain-teasing mathematical problems, introducing me to the fascinating world of
logic, algorithms, and, eventually, Informatics.
Special thanks to Alkis Polyzotis. He is the one that inspired and motivated me to
start this journey. His insights and advices during all these years enabled me to set higher
standards for my research. He has been a true friend and mentor, and I am very happy
that we have started a new journey together.
Finally, I come to the ones I thank the most for their constant love, support, and
encouragement, my parents Kyriaki and Pantelis Eirinakis. They believe in me and
always do everything in their power to let me pursue my dreams. I owe to them
everything that I have accomplished to this day. This thesis is dedicated to them.

iv

TABLE OF CONTENTS
LIST OF FIGURES ........................................................................................................ vii
LIST OF TABLES ........................................................................................................... ix
ABSTRACT...................................................................................................................... xi
1 Introduction.................................................................................................................... 1
1.2 Contributions........................................................................................................ 6
1.3 Thesis Outline ...................................................................................................... 9
2 Preliminaries & Related Work ................................................................................... 11
2.1 Usage Data Pre-processing ................................................................................ 11
2.2 Web Usage Mining and Personalization............................................................ 14
2.3 Integrating Content Semantics in Web Personalization..................................... 15
2.4 Integrating Structure in Web Personalization .................................................... 16
3 Semantic Web Personalization ................................................................................... 19
3.1 Motivating Example........................................................................................... 20
3.2 SEWeP System Architecture ............................................................................. 23
3.3 Similarity of Ontology Terms............................................................................ 25
3.3.2 THESUS Similarity Measure.................................................................... 26
3.4 Content Characterization ................................................................................... 26
3.4.1 Keyword Extraction .................................................................................. 27
3.4.2 Keyword Translation ................................................................................ 28
3.4.3 Semantic Characterization ........................................................................ 30
3.5 C-Logs Creation & Mining................................................................................ 32
3.6 Document Clustering ......................................................................................... 32
3.7 Recommendation Engine ................................................................................... 33
3.7.1 Semantic Recommendations..................................................................... 34

3.7.2 Category-based Recommendations........................................................... 35


3.8 Experimental Evaluation.................................................................................... 36
3.8.1 Methodology ............................................................................................. 37
3.8.2 Experimental Results ................................................................................ 38
3.9 System Prototypes.............................................................................................. 42
3.9.1 SEWeP Prototype...................................................................................... 42
3.9.2 The I-KnowUMine Project ....................................................................... 46
3.9.3 The Greek Web Archiving Project ........................................................... 47
3.10 Conclusions...................................................................................................... 49
4 Link Analysis for Web Personalization ..................................................................... 51
4.1 Motivating Example........................................................................................... 52
4.2 Preliminaries ...................................................................................................... 54
4.2.1 The Navigational Graph............................................................................ 54
4.2.2 Markov Models......................................................................................... 57
4.3 Usage-based PageRank...................................................................................... 60
4.3.1 PageRank .................................................................................................. 60
4.3.2 UPR: Link Analysis on the Navigational Graph ...................................... 62
4.4 Localized UPR (l-UPR) ..................................................................................... 64
4.4.1 The Personalized Navigational Graph (prNG) ......................................... 65
4.4.2 UPR-based Personalized Recommendations ............................................ 68
4.5 Web Path Prediction using hybrid Probabilistic Predictive Models.................. 68
4.5.1 Popular Path Prediction............................................................................. 69
4.5.2 Reconsidering Prior Probabilities Computation ...................................... 70
4.6 Experimental Evaluation.................................................................................... 72
4.6.1 Experimental Setup................................................................................... 72
4.6.2 l-UPR Recommendations Evaluation...................................................... 74
4.6.3 h-PPM Recommendations Evaluation..................................................... 77
4.6.4 Comparison of l-UPR and h-PPM ............................................................ 82
4.7 System Prototype ............................................................................................... 83
4.8 Conclusions........................................................................................................ 87

vi

5 Conclusions and Future Research.............................................................................. 88


5.1 Thesis Summary................................................................................................. 88
5.2 Discussion .......................................................................................................... 90
LIST OF REFERENCES ............................................................................................... 95
APPENDIX A ................................................................................................................ 103
APPENDIX B ................................................................................................................ 105

vii

LIST OF FIGURES
Figure 1. The web personalization process......................................................................... 3
Figure 2. SEWeP architecture........................................................................................... 24
Figure 3. The keyword translation procedure ................................................................... 30
Figure 4. The semantic characterization process .............................................................. 31
Figure 5. The semantic recommendation method............................................................. 35
Figure 6. The category-based recommendation method .................................................. 36
Figure 7. Experiment #1: Recommendation sets evaluation ........................................... 39
Figure 8. Experiment #2: Original vs. Hybrid Recommendations................................... 40
Figure 9. Experiment #3: Semantic vs. Hybrid Recommendations .................................. 41
Figure 10. Experiment #4: Category-based vs. Hybrid Recommendations ..................... 41
Figure 11. SEWeP screenshot: The Logs Preprocessing module ..................................... 44
Figure 12. SEWeP screenshot: the Session Management module.................................... 45
Figure 13. SEWeP screenshot: the Semantic Association Rules Mining module ............ 45
Figure 14. The IKUM system architecture ....................................................................... 46
Figure 15. The Greek Web Archiving system architecture .............................................. 48
Figure 16. PageRank-based example................................................................................ 53
Figure 17. Usage-based PageRank (UPR) example ......................................................... 53
Figure 18. NG Creation Algorithm ................................................................................... 56
Figure 19. Navigational Graph ......................................................................................... 57
Figure 20. NG synopsis (Markov Chain) .......................................................................... 59
Figure 21. prNG of Markov Chain NG synopsis .............................................................. 66
Figure 22. prNG of 2nd order Markov model NG synopsis ............................................. 66
Figure 23. Construction of prNG ...................................................................................... 67
Figure 24. Path expansion subroutine............................................................................... 67
Figure 25. Average OSim and KSim of top-n rankings for msnbc data set....................... 76
Figure 26. Average OSim and KSim of top-n rankings for cti data set............................. 76
Figure 27. OSim for msnbc data set, Markov Chain NG synopsis ................................... 78
Figure 28. KSim for msnbc data set, Markov Chain NG synopsis ................................... 78
Figure 29. OSim for cti data set, Markov Chain NG synopsis.......................................... 79

viii

Figure 30. KSim for cti data set, Markov Chain NG synopsis.......................................... 79
Figure 31. OSim for msnbc data set, 2nd-order Markov model NG Synopsis ................... 81
Figure 32. KSim for msnbc data set, 2nd-order Markov model NG Synopsis ................... 81
Figure 33. Comparison of l-UPR and h-PPM, Markov Chain NG synopsis .................... 83
Figure 34. The Prior Probabilities Computation module.................................................. 85
Figure 35. The Path Probabilities Computation module................................................... 86
Figure 36. The l-UPR Path Prediction module ................................................................. 86

ix

LIST OF TABLES
Table 1: Related Work ...................................................................................................... 18
Table 2. URIs and related concept hierarchy terms.......................................................... 22
Table 3. User Sessions ...................................................................................................... 56
Table 4. Path Frequencies ................................................................................................ 59
Table 5. Top-10 Frequent Paths...................................................................................... 105
Table 6. Top-10 ranking for Start setup.......................................................................... 106
Table 7. Top-10 ranking for Total setup......................................................................... 106

xi

ABSTRACT
The impact of the World Wide Web as a main source of information acquisition is
increasing dramatically. The existence of such abundance of information, in combination
with the dynamic and heterogeneous nature of the web, makes web site exploration a
difficult process for the average user. To address the requirement of effective web
navigation, web sites provide personalized recommendations to the end users. Most of the
research efforts in web personalization correspond to the evolution of extensive research
in web usage mining, i.e. the exploitation of the navigational patterns of the web sites
visitors. When a personalization system relies solely on usage-based results, however,
valuable information conceptually related to what is finally recommended may be
missed. Moreover, the structural properties of the web site are often disregarded.
In this thesis, we propose novel techniques that use the content semantics and the
structural properties of a web site in order to improve the effectiveness of web
personalization. In the first part of our work we present SEWeP (standing for SEmantic
Web Personalization), a personalization system that integrates usage data with content
semantics, expressed in ontology terms, in order to compute semantically enhanced
navigational patterns and effectively generate useful recommendations. To the best of our
knowledge, SEWeP is the only semantic web personalization system that may be used by
non-semantic web sites.
In the second part of our work, we present a novel approach for enhancing the quality
of recommendations based on the underlying structure of a web site. We introduce UPR
(Usage-based PageRank), a PageRank-style algorithm that relies on the recorded usage
data and link analysis techniques. UPR is applied on an abstraction of the user sessions
termed Navigational Graph in order to determine the importance of a web page. We
develop l-UPR, a recommendation algorithm based on a localized variant of UPR that is
applied to the personalized navigational sub-graph of each user. Moreover, we integrate
UPR and its variations in a hybrid probabilistic predictive model as a robust mechanism
for determining prior probabilities of page visits. Overall, we demonstrate that our

xii

proposed hybrid personalization framework results in more objective and representative


predictions than existing techniques that rely solely on usage data.

CHAPTER 1
Introduction
During the past few years the World Wide Web has become the biggest and most
popular way of communication and information dissemination. It serves as a platform for
exchanging various kinds of information, ranging from research papers, and educational
content, to multimedia content, software and personal logs (blogs). Every day, the web
grows by roughly a million electronic pages, adding to the hundreds of millions pages
already on-line. Because of its rapid and chaotic growth, the resulting network of
information lacks of organization and structure. Users often feel disoriented and get lost
in that information overload that continues to expand. On the other hand, the e-business
sector is rapidly evolving and the need for web market places that anticipate the needs of
their customers is more than ever evident. Therefore, the ultimate need nowadays is that
of predicting the user needs in order to improve the usability and user retention of a web
site. This thesis presents novel methods and techniques that address this requirement. We
elaborate on the problems that motivated our work in Section 1.1, and we outline our
contribution in Section 1.2. A rough plan of this thesis is included in Section 1.3.

1.1 Motivation
Imagine a user that navigates through the pages of a web portal, specializing in sports.
We will refer to it (hypothetically) as the Sportal, residing on the imaginary site
www.theSportal.com. This user is a fan of winter ski and would like to visit a ski resort

for the holidays. He therefore searches to find any related information available, ranging
from winter resort hotels to weather reports and ski equipment. Since the amount of
information in the Sportal is very big, this information is not necessarily organized as a
single thematic module. Based on this users navigation, however, in combination with
previous users visits focusing on the same subject (winter ski vacation), the system
makes recommendations to the user.
Assume, for example, that many users in the past have seen the pages
www.theSportal.com/events/ski.html, www.theSportal.com/travel/ski_resorts.html, and
www.theSportal.com/equipment/ski_boots.html during the same visit. If the current user
visits the first two, the system can recommend the third one, based on the assumption that
people with similar interests present similar navigational behavior. Moreover, since the
current visitor seems to be interested in pages concerning the winter, ski, and resorts
thematic areas of the portal, the system may recommend other pages that are related with
these

categories,

such

as

page

about

ski

equipment

that

is

on

sale

(www.theSportal.com/equipment/ski_boot_sale.html) or a page about hotels suited for


winter holidays (www.theSportal.com/travel/winter/hotels.html). Finally, we also note
that some pages of the Sportal are more important than others in terms of previous
users navigation and their position in the web sites graph. Consider, for example, a page
that provides information on the snow and weather conditions on all the ski resorts
(www.theSportal.com/weather/snowreport.html). If many users have visited this page
before, either by following links from other pages, or by directly jumping to it (using a
bookmark, for example), then this page seems important, and can also be recommended
to the current user. What is more, if the Sportal has registered users, and therefore can
explicitly collect information concerning their interests and topic preferences, it can
provide related recommendations to each individual visitor.
In brief, web personalization can be defined as any action that customizes the
information or services provided by a web site to an individual user, or a set of users,
based on knowledge acquired by their navigational behavior, recorded in the web sites
logs, in other words, its usage. This information is often combined with the content and
the structure of the web site, as well as the interests/preferences of the user, if they are

available. The web personalization process is illustrated in Figure 1. Using the four
aforementioned sources of information as input to pattern discovery techniques, the
system tailors the provided content to the needs of each visitor of the web site. The
personalization process can result in the dynamic generation of recommendations, the
creation of index pages, the highlighting of existing hyperlinks, the publishing of targeted
advertisements or emails, etc. In this thesis we focus on personalization systems that aim
at providing personalized recommendations to the web sites visitors. Furthermore, since
the personalization algorithms we propose in this work are generic and applicable to any
web site, we assume that no explicit knowledge involving the users profiles, such as
ratings or demographic information, is available.

Figure 1. The web personalization process


The problem of providing recommendations to the visitors of a web site has received
a significant amount of attention in the related literature. Most of the research efforts in
web personalization correspond to the evolution of extensive research in web usage
mining, taking into consideration only the navigational behavior of the (anonymous or
registered) visitors of the web site [ADW02, AP+04, BS04, HEK03, JF+97, KS04,
MD+00a, MPG03, MPT99, NM02, NP03, NP04, SK+00, SK+01]. Pure usage-based

personalization, however, presents certain shortcomings. This may happen when, for
instance, there is not enough usage data available in order to extract patterns related to
certain navigational actions, or when the web sites content changes and new pages are
added but are not yet included in the web logs. Moreover, taking into consideration the
temporal characteristics of the web in terms of its usage, such systems are very
vulnerable to the training data used to construct the predictive model. As a result, a
number of research approaches integrate other sources of information, such as the web
content [AG03, DM02, EGP02, GKG05, JZM04b, JZM05, MD+00b, ML+04, MSR04,
OB+03, PE00] or the web structure [BL06, HLC05, NM03, ZHH02b] in order to enhance
the web personalization process.
As already implied, the users navigation is largely driven by semantics. In other
words, in each visit, the user usually aims at finding information concerning a particular
subject. Therefore, the underlying content semantics should be a dominant factor in the
process of web personalization. The web sites content characterization process involves
the feature extraction from the web pages. Usually these features are keywords
subsequently used to retrieve similarly characterized content. Several methods for
extracting keywords that characterize web content have been proposed [BP98, CD+99,
HG+02]. The similarity between documents is usually based on exact matching between
these terms. This way, however, only a binary matching between documents is achieved,
whereas no actual semantic similarity is taken into consideration. The need for a more
abstract representation that will enable a uniform and more flexible document matching
process, imposes the use of semantic web structures, such as ontologies1 [BHS02,
HN+03]. By mapping the keywords to the concepts of an ontology, or topic hierarchy,
the problem of binary matching can be surpassed through the use of the hierarchical
relationships and/or the semantic similarities among the ontology terms, and therefore,
the documents.
Finally, we should take into consideration that the web is not just a collection of
documents browsed by its users. The web is a directed labeled graph, including a plethora

In this work we focus on the hierarchical part of an ontology. Therefore, in the rest of this work we use
the terms concept hierarchy, taxonomy and ontology interchangeably.

of hyperlinks that interconnect its web pages. Both the structural characteristics of the
web graph, as well as the web pages and hyperlinks underlying semantics are important
and determinative factors in the users navigational process. We briefly discuss the most
important research studies2 based on the aforementioned intuitions below, while a more
detailed overview of related work is given in Chapter 2.
Several research studies proposed frameworks that express the users navigational
behavior in terms of an ontology and integrate this knowledge in semantic web sites
[OB+03], Markov model-based recommendation systems [AG03], or collaborative
filtering systems [DM02]. Overall, all the aforementioned approaches are based on the
same intuition: enhance the web personalization process with content semantics,
expressed using the terms of a domain-ontology. The extracted web content features are
mapped to ontology terms and this abstraction enables the generalizations/specializations
of the derived patterns and/or user profiles. In all proposed models, however, the
ontology-term mapping process is performed manually or semi-automatically (needing
the manual labeling of the training data set). As far as the content characterization
process is concerned, the features characterizing the web content are extracted from the
web page itself, ignoring semantics arising from the connectivity features of the web
[BP98, CD+98]. Some approaches are based on collaborative filtering systems, which
assume that some kind of user ratings are available, or on semantic web sites, which
assume that an existing underlying semantic annotation of the web content is available a
priori. Finally, none of the aforementioned approaches fully exploits the underlying
semantic similarities of terms belonging to an ontology, apart from the straightforward
is-a or parent-child hierarchical relationships.
As far as the exploitation of the connectivity features of the web graph is concerned,
even though they have been extensively used for personalizing web search results
[ANM04, H02, RD02, WC+02], only a few approaches exist for enhancing the web
recommendation process, either using the degree of link connectivity for switching
among different recommendation models [NM03] or using citation network analysis for
clustering related pages in a recommendation system based on Markov models
2

At this point, we focus on research studies that appeared prior, or in parallel to our work.

[ZHH02b]. None of the aforementioned systems, however, exploits the notion of a web
pages importance in the web graph and fully integrates link analysis techniques in the
web personalization process.

1.2 Contributions
The main contribution of this thesis is a set of novel techniques and algorithms aimed
at improving the overall effectiveness of the web personalization process through the
integration of the content and the structure of the web site with the users navigational
patterns.
In the first part of our work we present the semantic web personalization system
SEWeP that integrates usage data with content semantics in order to compute
semantically

enhanced

recommendations.

navigational

Similar

to

patterns

previously

and

proposed

effectively
approaches,

generate
the

useful

proposed

personalization framework uses ontology terms to annotate the web content and the
users navigational patterns. The key departure from earlier approaches, however, is that
SEWeP is the only web personalization framework that employs automated keyword-toontology mapping techniques, while exploiting the underlying semantic similarities
between ontology terms. Apart from the novel recommendation algorithms we propose,
we also emphasize on a hybrid structure-enhanced method for annotating web content.
To the best of our knowledge, SEWeP is the only semantic web personalization system
that can be used by any web site, given only its web usage logs and a domain-specific
ontology.
Our key contributions regarding this framework are:

A methodology for semantically annotating the content of a web site using


ontology terms. The feature extraction is performed using an integration of
various techniques used in Information Retrieval that exploit both the content and
the connectivity features of the web pages. The mapping of these features to
ontology terms is a fully automated process, using appropriate similarity metrics
and a thesaurus. This characterization enables further processing (clustering,

association rules mining etc.) relying on the semantic similarity between web
documents.

An algorithm for processing multilingual content. All web documents, without


regard to the language they are written in, should be characterized by a set of
terms belonging to a domain-ontology. Therefore, prior to the ontology mapping,
an intermediate step is needed where all the keywords that characterize a
document are translated to a common language. We propose an automated
keyword translation algorithm based on the documents context.

Two recommendation algorithms which integrate web content semantics with the
users navigational behavior. The web pages are characterized by a set of domainontology terms. This uniform characterization enables the categorization of the
web pages into semantically coherent clusters, as well as the semantic
enhancement of the web logs. These two enhanced sources of knowledge are then
used by the proposed methods to generate recommendations that are semantically
relevant to the current navigational behavior of each user. The first method
generates recommendations by expanding the association rules derived by mining
the web logs, using the most similar document cluster. The second method
generates a new type of association rules, named category-based association rules,
which are computed by mining the semantically enhanced logs (called C-logs)
and expanding the recommendation set based on the most similar document
cluster.

An extensive set of user-based experiments (blind tests) which demonstrate the


effectiveness of the proposed methods, when integrated in a web personalization
framework, and support our initial intuition that content semantics enhance the
web personalization process.

In the second part of our work, we encompass the notion of authority transfer, as
defined in the most popular link analysis algorithm, PageRank [BP98]. The underlying
assumption is that a web page is considered to be important (in other words is an
authority) if other important pages have a link pointing it. In other words, authority pages
transfer some of their importance to the pages they link to, and so on. Motivated by the

fact that in the context of navigating a web site, a page/path is important if many users
have visited/followed it before, we propose a novel algorithm, named UPR, that assigns
importance rankings (and therefore visit probabilities) to the web sites pages. UPR
(Usage-based PageRank) is a PageRank-style algorithm that is applied on an abstraction
of the user sessions termed the Navigational Graph (NG). Based on this generalized
personalization framework, we specialize it in two different contexts. We develop l-UPR,
a recommendation algorithm based on a localized variant of UPR that is applied to the
personalized navigational sub-graph of each user for providing fast, online
recommendations. Moreover, we integrate UPR and its variations in a hybrid
probabilistic predictive model (h-PPM) as a robust mechanism for determining prior
probabilities of page visits. To the best of our knowledge, this is the first integrated
solution addressing the problem of web personalization using a page ranking approach.
More specifically, our key contributions are:

A unified personalization framework integrating web usage mining with link


analysis techniques for assigning probabilities to the web pages based on their
importance in the web site's navigational graph. We define UPR, a usage-based
personalized PageRank-style algorithm used for ranking the web pages of a site
based on previous users navigational behavior.

The introduction of l-UPR, a localized version of UPR which is applied to


personalized sub-graphs of the web navigational graph in order to provide fast,
online rankings of probable next pages of interests to current users. We describe
how these personalized sub-graphs are generated online, based on the current visit
of each user.

The application of UPR for extending and enhancing standard web usage mining
and personalization probabilistic models such as Markov models. We present a
hybrid probabilistic prediction framework (h-PPM) where UPR, as well as its
variations, are used for assigning prior probabilities to the nodes (pages) of any
Markov model based on the topology (structure) and the navigational patterns
(usage) of the web site.

An extensive set of experiments proving UPRs effectiveness in both proposed


frameworks. We apply UPR and its variations for assigning priors to be used by
different order Markov models and show that the recommendation accuracy is
better than pure-usage based approaches. Moreover, we apply l-UPR to localized
sub-graph synopses for generating online recommendations and again support our
claim for the need of enhancing the prediction process with information based on
the link structure in combination with the usage of a site.

1.3 Thesis Outline


The rest of this thesis is organized as follows. Chapter 2 presents a brief introduction
to several data preprocessing issues that should be addressed prior to applying any web
personalization techniques. It also reviews related work on the web usage mining and
personalization areas, emphasizing on web personalization methods that integrate content
or structure. All related efforts, including our work are summarized in Table 1, which is
included in the end of this Chapter. In Chapter 3 we present in detail the proposed
semantic web personalization framework. In Chapter 4 we demonstrate how link analysis
can be integrated in the web personalization process. Finally, Chapter 5 concludes this
Thesis and outlines directions for future work.

10

CHAPTER 2
Preliminaries & Related Work
In this Chapter we start by briefly presenting the data preprocessing issues that should
be taken into consideration prior to applying any web mining and personalization
techniques to the usage data. We then provide a review of related research efforts,
ranging from the earlier approaches that focus on web usage mining, to the ones focusing
on web personalization. We then present those that integrate content and/or structure data
in the web personalization process, emphasizing on the research efforts (previous and
subsequent) that are more similar to our work3. We provide a summarized overview of all
related research efforts categorized by the web mining method employed and their
application area in Table 1. The areas covered by our work are depicted by highlighted
cells.

2.1 Usage Data Pre-processing


The main data source in the web usage mining and personalization process is the
information residing on the web sites logs. Web logs record every visit to a page of the
web server hosting it. The entries of a web log file consist of several fields which
represent the date and the time of the request, the IP number of the visitors computer
(client), the URI requested, the HTTP status code returned to the client, and so on. The
web logs file format is based on the so called extended log format, proposed by W3C
3

A more detailed overview of the related work, as well as references to related commercial products can be
found in [EV03, E04].

12

[W3Clog]. In general, the extended log format consists of a list of prefixes and
identifiers, some of which can be found in Table such as c (client), s (server), r (remote),
cs (client to server), sc (server to client), sr (server to remote server, used by proxies), rs
(remote server to server, used by proxies), x (application-specific identifier), and a list of
identifiers such as date, time, ip (records the IP of the client generating the page hit),
bytes (records the number of bytes transferred), cached (records whether a cache hit
occurred), status (records the status code returned by the web server), comment (comment
returned with status code), method (method used to retrieve data), uri (the URI
requested), uri-stem and uri-query. Using a combination of some of the aforementioned
prefixes and identifiers, additional information such as referrer, that is the web page the
client was visiting before requesting that page, user_agent, that is the software the client
is using, or keyword, that is the keywords used when visiting that page after a search
engine query, can be recorded. Except for the web server logs, which are the main source
of information in the web usage mining and personalization processes, useful information
can be acquired from proxy server logs, browser logs, registration data, cookies, user
ratings etc. Since in this thesis we present a generic personalization framework which can
be applied on any web site, requiring only the anonymous usage data recorded in its web
usage logs, we do not elaborate on such data sources.
Prior to processing the usage data using web mining or personalization algorithms,
the information residing in the web logs should be preprocessed. The web log data preprocessing is an essential phase in the web usage mining and personalization process. An
extensive description of this process can be found in [CMS99]. In the sequel, we provide
a brief overview of the most important pre-processing techniques, providing in parallel
the related terminology.
The first issue in the pre-processing phase is data preparation. Depending on the
application, the web log data may need to be cleaned from entries involving page
accesses that returned, for example, an error or graphics file accesses. Furthermore,
crawler activity usually should be filtered out, because such entries do not provide useful
information about the sites usability. A very common problem to be dealt with has to do
with web pages caching. When a web client accesses an already cached page, this access

13

is not recorded in the web sites log. Therefore, important information concerning web
path visits is missed. Caching is heavily dependent on the client-side technologies used
and therefore cannot be dealt with easily. In such cases, cached pages can usually be
inferred using the referring information from the logs and certain heuristics, in order to
re-construct the user paths, filling out the missing pages.
After all page accesses are identified, the pageview identification should be
performed. According to [WCA] a pageview is defined as the visual rendering of a web
page in a specific environment at a specific point in time. In other words, a pageview
consists of several items, such as frames, text, graphics and scripts that construct a single
web page. Therefore, the pageview identification process involves the determination of
the distinct log file accesses that contribute to a single pageview. Again such a decision is
application-oriented.
In order to personalize a web site, the system should be able to distinguish between
different users or groups of users. This process is called user profiling. In case no other
information than what is recorded in the web logs is available, this process results in the
creation of aggregate, anonymous user profiles since it is not feasible to distinguish
among individual visitors. However, if the users registration is required by the web site,
the information residing on the web log data can be combined with the users
demographic data, as well as with their individual ratings or purchases. The final stage of
log data pre-processing is the partition of the web log into distinct user and server
sessions. A user session is defined as a delimited set of user clicks across one or more
web servers, whereas a server session, also called a visit, is defined as a collection of
user clicks to a single web server during a user session [WCA]. If no other means of
session identification, such as cookies or session ids is used, session identification is
performed using time heuristics, such as setting a minimum timeout and assume that
consecutive accesses within it belong to the same session, or a maximum timeout,
assuming that two consecutive accesses that exceed it belong to different sessions. More
details on the user and session identification process can be found in [EV03].

14

2.2 Web Usage Mining and Personalization


Web usage mining is the process of identifying representative trends and browsing
patterns describing the activity in the web site, by analyzing the users behaviour. Web
site administrators can then use this information to redesign or customize the web site
according to the interests and behavior of its visitors, or improve the performance of their
systems. Moreover, the managers of e-commerce sites can acquire valuable business
intelligence, creating consumer profiles and achieving market segmentation.
There exist various methods for analyzing the web log data [CMS97, SC+00]. Some
research studies use well known data mining techniques such as association rules
discovery [CPY96], sequential pattern analysis [B02, BB+99, BS00, SFW99], clustering
[KJ+01, NC+03, YZ+96], probabilistic models [BL99, DK04, JZM04a, LL03, S00,
ZB04, ZHH02a], or a combination of them [CH+00, YH03]. Since web usage mining
analysis was initially strongly correlated to data warehousing, there also exist some
research studies based on OLAP cube models [HN+01, ZXH98]. Finally some proposed
web usage mining approaches that require registered user profiles [HF04, SZ+97], or
combine the usage data with semantic meta-tags incorporated in the web sites content
[ML+04, OB+03].
Furthermore, this knowledge can be used to automatically or semi-automatically
adjust the content of the site to the needs of specific groups of users, i.e. to personalize
the site. As already mentioned, web personalization may include the provision of
recommendations to the users, the creation of new index pages, or the generation of
targeted advertisements or product promotions. The usage-based personalization systems
use association rules and sequential pattern discovery [MPT99], clustering [AP+04,
BS04], Markov models [ADW02], machine learning algorithms [HEK03, NP04], or are
based on collaborative filtering [MD+00a, NM02, NP03, SK+00, SK+01] in order to
generate recommendations. Some research studies also combine two or more of the
aforementioned techniques [JF+97, KS04, MPG03].

15

2.3 Integrating Content Semantics in Web Personalization


Several frameworks supporting the claim that the incorporation of information related
to the web sites content enhances the web personalization process have been proposed
prior [MD+00b, EGP02, PE00] or subsequent [JZM04b, JZM05, GKG05] to our work. In
this Section we overview in detail the ones that are more similar to ours, in terms of using
a domain-ontology to represent the web sites content.
Dai and Mobasher [DM02] proposed a web personalization framework that uses
ontologies to characterize the usage profiles used by a collaborative filtering system.
These profiles are transformed to domain-level aggregate profiles by representing each
page with a set of related ontology objects. In this work, the mapping of content features
to ontology terms is assumed to be performed either manually, or using supervised
learning methods. The defined ontology includes classes and their instances therefore the
aggregation is performed by grouping together different instances that belong to the same
class. The recommendations generated by the proposed collaborative system are in turn
derived by binary matching of the current user visit, expressed as ontology instances, to
the derived domain-level aggregate profiles, and no semantic similarity measure is used.
The idea of semantically enhancing the web logs using ontology concepts is
independently described by Oberle et. al. [OB+03]. This framework is based on a
semantic web site built on an underlying ontology. The authors present a general
framework where data mining can then be performed on these semantic web logs to
extract knowledge about groups of users, users preferences, and rules. Since the
proposed framework is built on a semantic web knowledge portal, the web content is
already semantically annotated (through the existing RDF annotations), and no further
automation is provided. Moreover, the proposed framework focuses solely on web
mining and thus does not perform any further processing in order to support web
personalization.
Acharyya and Ghosh [AG03] also propose a general personalization framework based
on the conceptual modeling of the users navigational behavior. The proposed
methodology involves mapping each visited page to a topic or concept, imposing a
concept hierarchy (taxonomy) on these topics, and then estimating the parameters of a

16

semi-Markov process defined on this tree based on the observed user paths. In this
Markov models-based work, the semantic characterization of the content is performed
manually. Moreover, no semantic similarity measure is exploited for enhancing the
prediction process, except for generalizations/specializations of the ontology terms.
Finally, in a subsequent work, Middleton et. al [MSR04] explore the use of ontologies
in the user profiling process within collaborative filtering systems. This work focuses on
recommending academic research papers to academic staff of a University. The authors
represent the acquired user profiles using terms of a research paper ontology (is-a
hierarchy). Research papers are also classified using ontological classes. In this hybrid
recommender system which is based on collaborative and content-based recommendation
techniques, the content is characterized with ontology terms, using document classifiers
(therefore a manual labeling of the training set is needed) and the ontology is again used
for making generalizations/specializations of the user profiles.

2.4 Integrating Structure in Web Personalization


Although the connectivity features of the web graph have been extensively used for
personalizing web search results [ANM04, H02, RD02, WC+02], only a few approaches
exist that take them into consideration in the web site personalization process. Zhu et. al.
[ZHH02b] use citation and coupling network analysis techniques in order to conceptually
cluster the pages of a web site. The proposed recommendation system is based on
Markov models. Nakagawa and Mobasher [NM03] use the degree of connectivity
between the pages of a web site as the determinant factor for switching among
recommendation models based on either frequent itemset mining or sequential pattern
discovery. Nevertheless, none of the aforementioned approaches fully integrates link
analysis techniques in the web personalization process by exploiting the notion of the
authority or importance of a web page in the web graph.
In a very recent work, Huang et. al. [HLC05] address the data sparsity problem of
collaborative filtering systems by creating a bipartite graph and calculating linkage
measures

between

unconnected

pairs

for

selecting

candidates

and

make

17

recommendations. In this study the graph nodes represent both users and rated/purchased
items.
Finally, subsequent to our work, Borges and Levene [BL06] proposed independently
two link analysis ranking methods, SiteRank and PopularityRank which are in essence
very much like the proposed variations of our UPR algorithm (PR and SUPR
respectively). This work focuses on the comparison of the distributions and the rankings
of the two methods rather than proposing a web personalization algorithm. The authors
concluding remarks, that the topology of the web site is very important and should be
taken into consideration in the web personalization process, further support our claim.

18

Table 1: Related Work


AREA
METHOD
General
framework/
architecture
Association
Rules Mining
(AR)
Sequential
Pattern
Discovery (SP)
Clustering (Cl)
OLAP cube
based model
Collaborative
Filtering (CF)
Markov models
(MM)
Probabilistic
models (PM)
Machine
Learning
methods (ML)
Link Analysis
algorithms (LA)
Method Combinations

MM & Cl

WUM

WUM
&
Profile

[CMS97,
SC+00]

WUM
&
Content

WP

WP
&
Profile

WP
&
Content

WP
&
Structure

[PE00]

[ZHH02b]

[MD+00b,
MSR04*,
DM02*]

[HLC05]

[ML+04,
OB+03*]

[CPY96]
[BB+99,
BS00,
SFW99]
[KJ+01,
NC+03,
YZ+96]
[HN+01,
ZXH98]
N/A

[HF04]
[AP+04,
BS04]

[SZ+97]

N/A

N/A

[BL99,
DK04,
LL03,S00,
ZHH02a,]
[ZB04,
JZM04a]

N/A

[MD+00a,
NM02,
NP03,SK+00,
SK+01]

[AG03*]

[ADW02]

[JZM04b]
[HEK03,
NP04]

[EGP02]
[BL06]

[CH+00,
YH03]

[MPG03]

AR & SP

[MPT99]

AR & Cl

N/A

N/A

N/A

N/A

ML & CF
MM & CF

N/A
N/A

N/A
N/A

N/A
N/A

N/A
N/A

PM & CF

N/A

N/A

N/A

N/A

LA&MM

*: Systems/Approaches that use ontologies


N/A: Not applicable method for this area

[NM03]
[GKG05]
[JF+97]
[KS04]
[JZM05]

19

CHAPTER 3
Semantic Web Personalization
The users navigation in a web site is typically content-driven. The users usually
search for information or services concerning a particular topic. Therefore, the underlying
content semantics should be a dominant factor in the process of web personalization. In
this thesis we present SEWeP (standing for Semantic Enhancement for Web
Personalization), a web personalization framework that integrates content semantics with
the users navigational patterns, using ontologies to represent both the content and the
usage of the web site.
In our proposed framework we employ web content mining techniques to derive
semantics from the web sites pages. These semantics, expressed in ontology terms, are
used to create semantically enhanced web logs, called C-logs (concept logs).
Additionally, the site is organized into thematic document clusters. The C-logs and the
document clusters are in turn used as input to the web mining process, resulting in the
creation of a broader, semantically enhanced set of recommendations. The whole process
bridges the gap between Semantic Web and Web Personalization areas, to create a
Semantic Web Personalization system. To the best of our knowledge, SEWeP is the only
system that provides an integrated solution for semantic web personalization and can be
used by any (semantic or not) web site, fully exploiting the underlying semantic
similarities of ontology terms. Parts of this chapter have appeared at [EVV03, EL+04].

20

In the Sections that follow we motivate the integration of content semantics in the
web personalization process using an illustrative example, and then present in more detail
the components of the SEWeP system. We conclude with an extensive experimental
evaluation of the system, as well as a brief description of system prototypes based (or
partly based) on the SEWeP framework.

3.1 Motivating Example


A web site can be personalized in various ways, such as the creation of new index
pages, personalized search services, or dynamic recommendations generation. In this
thesis, we deal with the latter case, that of predicting links that might be of interest for
each specific visitor.
Resuming from the example presented in the first Chapter, we assume that there
exists a web portal, specializing on sports activities, called (hypothetically) Sportal, and
residing on the imaginary site www.theSportal.com. This portal contains various
information resources about sports, ranging from sport news to advertisements on sport
events and equipment. The personalization system of Sportal applies association rules
mining on its web logs in order to generate recommendations to its visitors, based on the
assumption that users with similar interests have similar navigational behavior. Assume
that one of the discovered patterns is the following:
www.theSportal.com/events/ski.html,
www.theSportal.com/travel/ski_resorts.html

www.theSportal.com/equipment/ski_boots.html.
One may easily interpret this pattern as: people that are interested in ski events and
search for winter vacations will probably be interested in purchasing ski boots. Based on
the assumption that this user is interested in finding a ski resort to spend her holidays and
using pure usage-based personalization, the next time a user U navigates through
Sportal and visits the first two web pages, the personalized site will dynamically
recommend to U the page included in the right hand side (RHS) of the rule.

21

The Sportals content, however, is continuously updated. Suppose that the ski
equipment department has just announced a sale on all ski boots:
www.theSportal.com/equipment/ski_boot_sale.html.
Since this is a new web page, it isnt included in the web logs, or is included in very low
ratio (no one or only a few users have visited this page), therefore is definitely not
included in the derived association rules comprising our navigational model. As a
consequence, if we follow the traditional usage-based personalization process, it will
never be recommended to U, even though it is apparent that it is very similar to their
search intentions.
Moreover, assume that Sportal also hosts another service, about the snow
conditions in several ski resorts, in the web page:
www.theSportal.com/weather/snowreport.html.
Again, the information residing in this page is very relevant with Us interests, but it is
not included in the association rules derived. This may occur, for example, if the web
administrator hasnt add a link from the ski-related pages to the weather page, therefore
not many users have followed this path before.
As a third scenario, consider the case when U, instead of following the previous path,
visits the web pages
www.theSportal.com/sports/winter_sports/ski.html,
www.theSportal.com/travel/winter/hotels.html.
It is obvious that this visit is semantically similar to the previous one and the objective of
the user the same. The system however, will not provide the same recommendations to U,
since it wont recognize this similarity. Moreover, in case these two web pages are not
included in an association rule in the knowledge base, for any of the aforementioned
reasons, the system will recommend nothing to him!
Based on the aforementioned example, it is evident that pure usage-based
personalization is problematic in several cases. We claim that information conceptually
related to the users visit should not be missed, and introduce the SEWeP
personalization system that addresses the aforementioned shortcomings by generating
semantically enhanced recommendations.

22

Before getting into more technical details, we demonstrate SEWePs functionality


through a use-case scenario based on the aforementioned example. The users visits to the
web pages of Sportal are recorded in the web server logs. These logs, after being
preprocessed, are used along with the web sites semantics as input in the process of CLogs creation. In other words, the content of the web site is processed in order to be
semantically characterized by the categories of a related ontology. Table 2 shows some
URIs along with the respective domain-ontology terms that semantically characterize
them.
Table 2. URIs and related concept hierarchy terms
URI
www.theSportal.com/events/ski.html

Concept hierarchy terms


sport, event, ski, schedule

www.theSportal.com/sports/winter_sports/ski.html

sport, winter, ski

www.theSportal.com/travel/ski_resorts.html

travel, ski, resort

www.theSportal.com/travel/winter/hotels.html

travel, winter, hotel, sport

www.theSportal.com/equipment/ski_boots.html

snow, ski, equipment

www.theSportal.com/equipment/ski_boot_sale.html

snow, ski, equipment

www.theSportal.com/weather/snowreport.html

weather, snow, ski

Based on the semantic similarity between these terms, the respective web pages are
categorized in semantic clusters (since the terms are hierarchically correlated). SEWeP
recommendation engine generates both URI-based (as any usage-based personalization
system does) and category-based association rules (e.g. snow, winter, hotel travel,
equipment). These rules are then expanded to include documents that fall under the most
similar semantic cluster.
Returning to our scenario, assume that the user visits the web pages:
www.theSportal.com/events/ski.html, and
www.theSportal.com/travel/ski_resorts.html.

23

The system, based on the URI-based association rules derived from web log mining,
finds the most relevant rule and recommends its RHS to him. This recommendation set
will be referred to as original recommendations:
www.theSportal.com/events/ski.html,
www.theSportal.com/travel/ski_resorts.html

www.theSportal.com/equipment/ski_boots.html.
Moreover, it expands the recommendation set by including documents that belong to the
same thematic cluster as the URI proposed, generating semantic recommendations:
www.theSportal.com/equipment/ski_boot_sale.html
www.theSportal.com/weather/snowreport.html.
Assume now that another user navigates through the web site, visiting the web pages
www.theSportal.com/sports/winter_sports/ski.html,
www.theSportal.com/travel/winter/hotels.html.
Based on the derived URI-based association rules, a usage-based personalization system
would not find a matching association rule and wouldnt recommend anything. SEWeP,
however, based on the category-based association rules it generates, abstracts the users
visit and matches it with the category-based rule:
ski, winter, travel

snow, equipment
It then recommends documents that belong to the cluster which is characterized by the
RHS terms. This recommendation set will be referred to as category-based
recommendations. In what follows, we describe in detail how SEWeP implements the
aforementioned process.

3.2 SEWeP System Architecture


SEWeP uses a combination of web mining techniques to personalize a web site. In
short, the web sites content is processed and characterized by a set of ontology terms
(categories). The visitors navigational behavior is also updated with this semantic

24

knowledge to create an enhanced version of web logs, C-logs, as well as semantic


document clusters. C-Logs are in turn mined to generate both a set of URI and categorybased association rules. Finally, the recommendation engine uses these rules, along with
the semantic document clusters in order to provide the final, semantically enhanced set of
recommendations to the end user.

Figure 2. SEWeP architecture


As illustrated in Figure 2, SEWeP consists of the following components:

Content Characterization. This module takes as input the content of the web site
as well as a domain-specific ontology and outputs the semantically annotated
content to the modules that are responsible for creating the C-Logs and the
semantic document clusters. The content characterization process consists of the
keyword extraction, keyword translation and semantic characterization subprocesses which are described in more detail in Section 3.4.

Semantic Document Clustering. The semantically annotated pages created by the


previous component are grouped into thematic clusters. This categorization is
achieved by clustering the web documents based on the semantic similarity
between the ontology terms that characterize them. This process is described in
Section 3.5.

C-Logs Creation & Mining. This module takes as input the web sites logs as well
as the semantically annotated web site content. It outputs the semantically

25

enhanced C-logs (concept logs) which are in turn used to generate both URI and
category-based frequent itemsets and association rules. These rules are
subsequently matched to the current users visit by the recommendation engine.
We overview this process in Section 3.6.

Recommendation Engine. This module takes as input the current users path and
matches it with the semantically annotated navigational patterns generated in the
previous phases. The recommendation engine generates three different
recommendation sets, namely, original, semantic and category-based ones,
depending on the input patterns used. In Section 3.7 we overview the two novel
recommendation algorithms that are employed by SEWeP.

The creation of the ontology as well as the semantic similarity measures used as input
in the aforementioned web personalization process are orthogonal to the proposed
framework. We assume that the ontology is descriptive of the web sites domain and is
provided/created by a domain expert. In what follows we describe the key components of
our architecture, starting by introducing the similarity measures we used in our work.

3.3 Similarity of Ontology Terms


As already mentioned, SEWeP exploits the expressive power of content semantics,
them being represented by ontology terms. Using such a representation, the similarity
between documents is deduced to the distance between terms that are part of a hierarchy.
The need for such a similarity measure is encountered throughout the personalization
process, namely, during content characterization, keyword translation, document
clustering and recommendations generation. In our approach, we adopt the Wu &
Palmer similarity measure [WP94] for calculating the distance between terms that belong
to a tree (hierarchy). Moreover, we use its generalization, proposed by Halkidi et. al.
[HN+03] to compute the similarity between sets of terms that belong to a concept
hierarchy. Henceforth, we will refer to these metrics as WPsim and THEsim respectively.
We should stress that the choice of the similarity measure is orthogonal to the rest system
functionality, as long as it serves for calculating the distance between hierarchically

26

organized terms [EM+06, MT+05]. The definitions of the two similarity measures are
given in what follows.
3.3.1 Wu&Palmer Similarity Measure
Given a tree, and two nodes a, b of this tree, their similarity is computed as follows:
WPsim ( a, b) =

depth (a ) + depth (b)


2 * depth (c)

(1)

where the node c is their deepest (in terms of tree depth) common ancestor.
3.3.2 THESUS Similarity Measure
Given a concept hierarchy O and two sets of weighted terms A={(wi, ki)} and B={(vi,
hi)}, with wi, vi O, their similarity is defined as:
1 |B|

1 1 | A|
THEsim( A, B ) = max (i , j WPsim(ki , h j ))+ max ( i , j WPsim(hi , k j ))
2 K i =1 j[1,|B|]
H i =1 j[1,| A|]

where i , j =

wi + v j

2 max wi , v j

(2)

|A|

and K =

i , x( i )

,with

i =1

x(i) = x | i , x WPsim(ki , hx ) = max (i , x WPsim(ki , h j )) .


j[1,| B|]

3.4 Content Characterization


A fundamental component of the SEWeP architecture is the automated content
characterization process. SEWeP is the only web personalization framework enabling the
automated annotation of web content with ontology terms without needing any human
labeling or prior training of the system. The keywords extraction is based both on the
content of the web pages, as well as their connectivity features. What is more, SEWeP
enables the annotation of multilingual content, since it incorporates a context-sensitive
translation component which can be applied prior to the ontology mapping process. In the
subsections that follow we describe in detail the aforementioned processes, namely, the
keyword extraction, keyword translation and semantic characterization modules.

27

3.4.1 Keyword Extraction


There exists a wealth of methods for representing web documents, most of which
have emerged from the area of searching and querying the web [BP98, CD+99, HG+02].
The most straightforward approach is to perform text mining in the document itself
following standard Information Retrieval (IR) techniques. This approach, however, has
been shown insufficient for the web content, since it relies solely on the information
included in the document ignoring semantics arising from the connectivity features of the
web [BP98, CD+98]. It is difficult to extract keywords from web documents that contain
images, programs etc. Additionally, many web pages do not include words that are the
most descriptive ones for their content (for example rarely a portal web site includes the
word portal in its home page). Therefore, in many approaches information contained in
the links that point to the document and the text near them - defined as anchor-window
[CD+97] - is used for characterizing a web document [CD+99, HG+02, PW00, VV+04].
This approach is based on the hypothesis that the text around the link to a page is
descriptive of the pages contents and overcomes the problems of the content-based
approach, since it takes into consideration the way others characterize a specific web
page. In our work, we adopt and extend this approach, by also taking into consideration
the content of the pages that are pointed by the page that is processed, based on the
assumption that in most web pages the authors include links to topics that are of
importance in the pages context.
More specifically, the keywords that characterize a web page p are extracted using:
1. raw term frequency of p
2. raw term frequency of a selected fraction (anchor-window) of the web pages that
point to p (inlinks)
3. raw term frequency of the web pages that are pointed by p (outlinks)
The three keyword extraction methods can be applied interchangeably or in combination.
We should explain at this point the decision concerning term weighting phase, when
the extracted keywords are given weights in order to use the most important ones. Term
weighting, extensively used in the vector space model for document clustering, is carried
out using several methods, such as raw term frequency, or algorithms belonging to the

28

Tf*Idf family [SB98]. Raw term frequency is based on the term statistics within a
document and is the simpler way of assigning weights to terms. Tf*Idf is a method used
for collections of documents, i.e. documents that have similar content. In the case of a
Web site however, this assumption is not always true since a Web site may contain
documents that refer to different thematic categories (especially in the case of Web
portals) and this was the reason for choosing raw term frequency as the term weighting
method of our approach.
At the end of this phase, each document d is characterized by a weighted set of
keywords d = {(ki,wi)}, where wi is the weight representing the summed (over the
combination of methods) word frequency of keyword ki. Before proceeding with
mapping the extracted keywords to related ontology terms, all non-English keywords
should be translated. In our approach, we determine the most suitable synonym using a
context-sensitive automated translation method, which is described in detail in the
Section that follows.
3.4.2 Keyword Translation
As already mentioned, the recommendation process is based on the characterization
of all web documents using a common representation. Since many web sites contain
content written in more than one language, this raises the issue of mapping keywords
from different languages to the terms of a common domain-ontology.
Consider, for example, the web site of a Computer Science department, or of a
research group in Greece. This site will contain information addressed to the students,
which will be written in Greek, research papers, which will be written in English, and
course material, which will be written in both languages. Since the outcome of the
keyword extraction process is a mixed set of English and Greek words, the translation of
all Greek keywords to English should be performed, prior to selecting the most frequent
ones. By using any dictionary, each Greek word (after stemming and transforming to the
nominative) will be mapped to a set of English synonyms; the most appropriate synonym,
however, depends on the context of the web pages content. A naive approach would be
to keep all possible translations, or a subset of them, but this would result in a high
number of keywords and would lead to inaccurate results. Another less computationally

29

intensive approach would be to keep the first translation returned by the dictionary,
which is the most common one. The first translation, however, is not always the best.
For example the words plan, schedule and program are some of the translations of
the same Greek word (), however in the Informatics context the word
program is the one that should be selected.
To address this important issue, we propose to determine the most precise synonym
based on the content of the web page it was extracted from. Assuming that the set of
keywords will be descriptive of the web pages content, we derive the best synonym set
by comparing their semantics. This context-sensitive automated translation method is
applicable for any language, provided that a dictionary and its inflection rules are
available. In our system implementation we applied it for the Greek language.
Since all words in the Greek language (nouns, verbs, adverbs) can be inflected, we
perform stemming and transformation to the nominative of each Greek word prior to
applying the actual translation method. For this purpose, we used the inflection rules of
Triantafillidis Grammar [Tria]. The translation algorithm is depicted in Figure 3. The
input is the set of English and Greek keywords (En(D) and Gr(D) respectively) of
document D. The output is a set of English keywords K that best characterize the web
page. Let En(g) = {english translations of g, g Gr(D)} and Sn(g) = {Wordnet senses of
keywords in En(g)}. For every translated words sense (as defined by Wordnet), the
algorithm computes the sum of the maximum similarity between this sense and the senses
of the remaining keywords (let WPsim denote the Wu&Palmer distance between two
senses). Finally, it selects the English translation with the maximum scored sense. The
algorithm has complexity O(kn2) for every Greek keyword, where n is the number of
senses for every keyword and k is the number of remaining words w. Since this algorithm
is applied off-line once for every document D, it does not constitute a bottleneck in the
systems online performance.

30

An initial experimental evaluation [LE+04] has shown promising results for the
proposed approach, but several issues remain open. For instance, our technique makes an
implicit assumption of one sense per discourse, i.e., that multiple appearances of the
same word will have the same meaning within a document. This assumption might not
hold in several cases, thus leading to erroneous translations. Our technique constitutes a
first step toward the automated mapping of keywords to the terms of a common concept
hierarchy; clearly, a more extensive study is required in order to provide a complete and
more precise solution.
Procedure translateW(Gr,En)
1. K ;
2. for all g Gr(D) do
3.

for all s Sn(g) do

4.

score[s] = 0;

5.

for all w En(D)U Gr(D)-{g} do

6.

sim = max(WPsim(s, Sn(w)));

7.

score[s] += sim;

8.

done

9.

done

10.

smax = s;

(score[s] = max(score[s]), s Sn(g))


11. K e, e En(g), e contains smax;
12. done

Figure 3. The keyword translation procedure


3.4.3 Semantic Characterization
In order to assist the remainder of the personalization process (C-logs creation,
semantic document clustering, semantic recommendations) the n most frequent
(translated) keywords that where extracted in the previous phase, are mapped to the terms
O = {c1, , ck.} of a domain ontology (in our approach we need the concept hierarchy
part of the ontology). This mapping is performed using a thesaurus4. If the keyword
belongs to the ontology, then it is included as it is. Otherwise, the system finds the
closest (i.e. most similar) term (category) to the keyword through the mechanisms
4

In our system implementation, we use WordNet [WN] for this purpose.

31

provided by the thesaurus. Since the keywords carry weights according to their
frequency, the categories weights are also updated.
We should stress here that the selection of the ontology influences the outcome of the
mapping process. For this purpose, it should be semantically relevant to the content to be
processed. In order to find the closest term in the ontology O for a keyword k that
describes a document, we compute the Wu & Palmer similarity [WP94] between all
senses of k, Sn(k) and all senses of all the categories c in O, Sn(ci). At the end of this
process, each keyword is mapped to every category with a similarity s respectively. We
select the (k,c) pair that gives the maximum similarity s. This process is shown in Figure
4.
Procedure CategoryMapping(k, O)
1. for all sns Sn(k) do
2.
for all ci O do
3.
scsimmax maxargsc Sn(ci)(WPsim(sns, sc));
4.
done
5.
ssimmax = max({scsimmax});
6.
cmax = c O, for which (scsimmax == ssimmax);
7. done
8. sim = max({ssimmax});
9. cat = c {cmax}, for which (ssimmax == sim);
10. return(cat, sim);
11. done

Figure 4. The semantic characterization process


If more than one keywords are mapped to the same category ci, the relevance ri
assigned to it is computed using the following formula:

(w

ri =

sj)

k j ci

wj

(3)

k j ci

where wj is the weight assigned to keyword kj for document d and sj the similarity with
which kj is mapped to ci. At the end of this process, each document d is represented as a
set d = {(ci, ri)}, where ri [0,1] since sj [0,1]. Even though the aforementioned process
has complexity O(|c|n2), where n is the number of senses of a word, it doesnt aggravate
the systems performance, since it is performed offline once, and is repeated only when
the content of a web page changes or when new web pages are added in the web site.

32

3.5 C-Logs Creation & Mining


C-Logs are in essence the web sites logs enhanced with content semantics. The CLogs creation process involves the association of each web logs record with the ontology
terms that represent the respective URI. C-logs are processed using the same statistical
and data mining techniques applied to web logs, such as association rules, clustering or
sequential pattern discovery.
The web mining algorithms currently supported by SEWeP is frequent itemsets and
association rules discovery. Both algorithms are based on a variation of the Apriori
algorithm [AS94], used to extract patterns that represent the visitors navigational
behavior in terms of pages often visited together. The input to the algorithm is the
recorded users sessions expressed both in URI and category level. The output is a set of
URI and category-based frequent itemsets or association rules respectively. Since no
explicit user/session identification data are available, we assume that a session is defined
by all the pageview visits made by the same IP, having less than a maximum threshold
time gap between consecutive hits.

3.6 Document Clustering


After the content characterization process, all web documents are semantically
annotated with terms belonging to a concept hierarchy. This knowledge is materialized
by grouping together documents that are characterized by semantically similar terms, i.e.
neighboring categories in the hierarchy. This categorization is achieved by clustering the
web documents based on the similarity among the ontology terms that characterize each
one of them. The generated clusters capture semantic relationships that may not be
obvious at first sight, for example documents that are not structurally close (i.e. under
the same root web sites path).
The problem of document clustering is considerably different from the case of
clustering points in a metric space. In SEWeP the objects to be clustered are sets of
(weighted) terms of a concept hierarchy that correspond to the extracted categories for
each document. In this space, there are no coordinates and ordering as in a Euclidean
metric space. We can only compute the similarity between documents given an

33

appropriate similarity measure between sets of weighted categories, therefore we use


THEsim, as defined earlier, with a modification of the density-based algorithm DBSCAN
[EK+98] for clustering the documents. After the document clustering, each cluster is
labeled by the most descriptive categories of the documents it contains, i.e. the categories
that characterize more than t% of the documents. Modification details and the algorithm
itself are described in [HN+03, VV+04]. The semantic document clusters are used in turn
in order to expand the recommendation set with semantically similar web pages, as we
describe in Section 3.7.

3.7 Recommendation Engine


As already mentioned, after the document characterization and clustering processes
have been completed, each document d is represented by a set of weighted terms
(categories) that are part of the concept hierarchy: d = {(ci, ri)}, ci O, ri [0,1] (O is
the concept hierarchy, ri is cis weight). In the motivating example, we presented how this
knowledge can be transformed into three different types of recommendations, depending
on the rules that are used as input (association rules between URIs or between categories)
and the involvement of semantic document clusters: original, semantic, and categorybased recommendations.
Original

recommendations

are

the

straightforward

way

of

generating

recommendations, simply relying in the usage data of a web site. They are generated
when, for each incoming user, a sliding window of her past n visits is matched to the
URI-based association rules in the database, and the m most similar ones are selected.
The system recommends the URIs included in the rules, but not visited by the user so far.
The intuition behind semantic recommendations is that, useful knowledge
semantically similar to the one originally proposed to the users, is omitted for several
reasons (updated content, not enough usage data etc.) Those recommendations are in the
same format as the original ones but the web personalization process is enhanced by
taking into account the semantic proximity of the content. In this way, the system's
suggestions are enriched with content bearing similar semantics. In short, they are
generated when, for each incoming user, a sliding window of their past n visits is

34

matched to the URI-based association rules in the database, and the single most similar
one is selected. The system finds the URIs included in the rule but not yet visited by the
user (let A) and recommends the m most similar documents that are in the same semantic
cluster as A.
Finally, the intuition behind category-based recommendations is the same as the one
of semantic recommendations: incorporate content and usage data in the recommendation
process. This notion, however, is further expanded by expressing the users navigational
behavior in a more abstract, yet semantically meaningful way. Both the navigational
patterns knowledge database and the current users profile are expressed by categories.
Therefore, pattern matching to the current users navigational behavior is no longer exact
since it utilizes the semantic relationships between the categories, as expressed by their
topology in the domain-specific concept hierarchy. The final set of recommendations is
generated when, for each incoming user, a sliding window of the users past n visits is
matched to the category-based association rules in the database, and the most similar is
selected. The system finds the most relevant document cluster (using similarity between
category terms) and recommends the documents that are not yet visited by the user.
In what follows, we describe in detail the semantic and category-based
recommendations algorithms. The description of the generation of original
recommendations is omitted, since it is a straightforward application of the Apriori
algorithm to the sessionized web logs.
3.7.1 Semantic Recommendations
Navigational patterns. We use the Apriori algorithm [AS94] to discover frequent
itemsets and/or association rules from the C-Logs, CLg. We consider that each distinct
user session represents a different transaction. We will use S = { Im }, to denote the final
set of frequent itemsets/association rules, where Im = {(urii)}, urii CLg.
Recommendations. In brief, the recommendation method takes as input the users
current visit, expressed a set of URIs: CV = {(urij)}, urij WS, (WS is the set of the web
sites URIs. Note that some of these may not be included in CLg). The method finds the
itemset in S that is most similar to CV, and recommends the documents (labeled by
related categories) belonging to the most similar document cluster Clm Cl (Cl is the set

35

of document clusters). In order to find the similarity between URIs, we perform binary
matching (denoted as SIM). This procedure is shown in Figure 5.
Procedure SemanticRec(CV)
1. CM ;
2. Im = maxargIS SIM(I,CV);
3. for all d Im do
4.
for all cj d do
5.
if cj CM then
6.
rj += rj;
CM (cj,rj);
7.
8.
else
9.
CM (cj,rj);
10. done
11. done
12. return D = {d}, {d} Clm, maxargClnCl WPsim(CLm,CM);

Figure 5. The semantic recommendation method


3.7.2 Category-based Recommendations
Navigational patterns. We use an adaptation of the Apriori algorithm [AS94] to
discover frequent itemsets and/or association rules including categories. We consider that
each distinct user session represents a different transaction. Instead of using as input the
distinct URIs visited, we replace them with the respective categories. We keep the most
important ones, based on their frequency (since the same category may characterize more
than one documents). We then apply the Apriori algorithm using categories as items. We
will use C = { Ik }, to denote the final set of frequent itemsets/association rules, where Ik
= {(ci,, ri)}, ri O, ri [0,1] (ri reflects the frequency of ci).
Recommendations. In short, the recommendation method takes as input the users
current visit, expressed in weighted category terms: CV = {(cj, fj)}, cj O, fj [0,1] (fj is
the normalized frequency of cj in the current visit). The method finds the itemset in C that
is most similar to CV, creates a generalization of it and recommends the documents
(labeled by related categories) belonging to the most similar document cluster Cln Cl
(Cl is the set of document clusters). To find the similarity between categories we use the
Wu & Palmer metric, whereas in order to find similarity between sets of categories, we

36

use the THESUS metric (denoted as WPsim and THEsim respectively), defined in Section
3. This procedure is shown in Figure 6.
Procedure CategoryRec(CV)
1. Ik = maxargIS THEsim(I,CV);
2. for all cj CV do
3.
ci = maxargcIk WPsim(c,cj);
4.
cn = least_common_ancestor(ci,cj), rn = max(ri,rj);
5.
CI (cn, rn);
6. done
7. return D = {d}, {d} Cln, maxargClnCl WPsim(CLn,CI);

Figure 6. The category-based recommendation method


The same procedure can be run by omitting the weights in one or all the phases of the
algorithm. On the other hand, in case weights are used, an extension of the Apriori
algorithm, which incorporates weights in the association rules mining process, such as
[SA95], can be used. Let us also stress that even though this description of the method
focuses on sets representation (derive frequent itemsets and use them in the
recommendation method), it can also be applied (with no further modification) to the
association rules that can be derived by those sets. If association rules are derived, then
the users activity is matched to the LHS of the rule (step 2), and recommendations are
generated using the RHS of the rule (step 7).

3.8 Experimental Evaluation


So far, we have described the framework for enhancing the recommendation process
through content semantics. Our claim, that the process of semantically annotating web
content using terms derived from a domain-specific taxonomy prior to the
recommendation process enhances the results of web personalization, is intuitive. Since
the objective of the system is to provide useful recommendations to the end users, we
performed an experimental study, based on blind testing with 15 real users, in order to
validate the effectiveness of our approach. The results indicate that the effectiveness of
each recommendation set (namely, Original, Semantic, Category), depends on the
context of the visit and the users interests. What is evident, however, is that a hybrid

37

model, incorporating all three types of recommendations, generates the most effective
results.
3.8.1 Methodology
Data Set. We used the web logs of the DB-NET web site [DBN]. This is the site of a
research team, which hosts various academic pages, such as course information, research
publications, as well as members home pages. The two key advantages of using this data
set are that the web site contains web pages in several formats (such as pdf, html, ppt,
doc, etc.), written both in Greek and English and a domain-specific concept hierarchy is
available (the web administrator created a concept-hierarchy of 150 categories that
describe the sites content). On the other hand, its context is rather narrow, as opposed to
web portals, and its visitors are divided into two main groups: students and researchers.
Therefore, the subsequent analysis (e.g. association rules) uncovers these trends: visits to
course material, or visits to publications and researcher details. It is essential to point out
that the need for processing online (up-to-date) content, made it impossible for us to use
other publicly available web log sets, since all of them were collected many years ago
and the relevant sites content is no longer available. Moreover, the web logs of popular
web sites or portals, which would be ideal for our experiments, are considered to be
personal data and are not disclosed by their owners. To overcome these problems, we
collected web logs over a 1-year period (1/11/02 31/10/03). After preprocessing, the
total web logs size was approximately 105 hits including a set of over 67.500 distinct
anonymous user sessions on a total of 357 web pages. The sessionizing was performed
using distinct IP & time limit considerations (setting 20 minutes as the maximum time
between consecutive hits from the same user).
Keyword Extraction Category Mapping. We extracted up to 7 keywords from each
web page using a combination of all three methods (raw term frequency, inlinks,
outlinks). We then mapped these keywords to ontology categories and kept at most 5 for
each page.
Document Clustering. We used the clustering scheme described in [HN+03], i.e. the
DBSCAN clustering algorithm and the THESim similarity measure for sets of keywords.

38

However, other web document clustering schemes (algorithm & similarity measure) may
be employed as well.
Association Rules Mining. We created both URI-based and category-based frequent
itemsets and association rules. We subsequently used the ones over a 40% confidence
threshold.
3.8.2 Experimental Results
In our experiments, we chose three popular paths followed by users in the past, each
one having a different objective; one (A) containing visits to contextually irrelevant
pages (random surfer), a second (B) including a small path to very specialized pages
(information seeking visitor), and a third one (C) including visits to top-level, yet
research-oriented pages (topic-oriented visitor). We then conducted a series of 4
experiments. These paths, along with the recommendations generated for Experiment #2
are included in Appendix A.
Experiment #1. For the first experiment, we created three different sets of
recommendations named Original, Semantic, and Category (the sets are named after the
respective recommendation methods). We presented the users with the paths and the
three sets (unlabeled) in random order and asked them to rate them as indifferent,
useful or very useful. The outcome is shown in Figure 7.
The results of the first experiment revealed the fact that depending on the context and
purpose of the visit the users profit from different source of recommendations. More
specifically, in visit A, both Semantic and Category sets are mostly evaluated as
useful/very useful. The Category recommendation set performs better, and this can be
explained by the fact that its the one that recommends 3 hub pages, which seems to be
the best after a random walk on the site. On the other hand, in visits B and C, Semantic
performs better. In visit B, the path was focused to specific pages and the same held for
the recommendations preferences. In visit C the recommendations that were more
relevant to the topics previously visited were preferred.

39

Path A

70
60
50
40
30
20
10
0

l
c
na
nti
igi
ory
Or
ma
e
teg
S
a
C

Very Useful
Useful
Indifferent

Path B
70
60
50
40
30
20
10
0

l
c
na
nti
igi
ory
Or
ma
e
teg
S
a
C

Very Useful
Useful
Indifferent

Path C

70
60
50
40
30
20
10
0
l
na
tic
igi
r
an
ory
O
m
teg
Se
a
C

Very Useful
Useful
Indifferent

Figure 7. Experiment #1: Recommendation sets evaluation

40

For that reason, we decided to evaluate the performance of a hybrid method that
incorporates all three types of recommendations. We run a set of experiments comparing,
for each path, each one of the proposed recommendation sets with a Hybrid
recommendation set, containing the top recommended URIs from each of the three
methods (Original, Semantic, Category). We then asked the users to choose the
recommendation set they preferred more.
Experiment #2: In the second experiment, we asked the users to choose between the
Hybrid and the Original recommendations set. The outcome is shown in Figure 8.
Original vs. Hybrid Recommendations

User Preferences (%)

Original

Hybrid

80
70
60
50
40
30
20
10
0

Path A

Path B

Path C

Figure 8. Experiment #2: Original vs. Hybrid Recommendations


The results of this experiment verify our intuition that the users benefit from the
semantic enhancement of the recommendations. Again, this depends on each visits
purpose, but in total users rate the Hybrid SEWeP outcome as equal as or better than the
pure usage-based one.
Experiment #3: In the third experiment, we presented the users with the Semantic and
the Hybrid recommendation set and asked them to rank them in terms of their usefulness
in navigation. The outcome is shown in Figure 9.
We observe that Hybrid recommendation set seems to perform better in the case of
the random path , whereas the Semantic recommendation set prevails in the case of
the more specialized paths B and C. Hence, we cannot conclude on which
recommendation set is better, since it depends on the path and therefore the users

41

specific interests, even though the Semantic recommendation set seems to prevail in the
case of specialized information seeking visits.
Semantic vs. Hybrid Recommendations

User preferences (%)

Semantic

Hybrid

90
80
70
60
50
40
30
20
10
0

Path A

Path B

Path C

Figure 9. Experiment #3: Semantic vs. Hybrid Recommendations


Experiment #4: In the fourth experiment, users had to select between the Categorybased recommendation set and the Hybrid one. The outcome is shown in Figure 10. The
results of this experiment demonstrate the dominance of the Hybrid recommendation set
over the Category-based one. One explanation for this would be that in the second case,
important information may be lost during the generalization (convert users current path
to categories) back to specialization (convert categories to URIs) process.
Category-based vs. Hybrid
Recommendations

User preferences (%)

Category-based

Hybrid

100
80
60
40
20
0

Path A

Path B

Path C

Figure 10. Experiment #4: Category-based vs. Hybrid Recommendations

42

Based on these experimental results, we observe that what is characterized as useful


by the users depends on the objective of each visit. Out of the three possible
recommendation sets, the Semantic recommendation set, generated after the semantic
expansion of the most popular association rule performs better. Comparing all three
recommendation sets with the Hybrid one, we observe that it dominates the other three,
since the hybrid recommendations are preferred by the users in most cases (7 out of 9).
Therefore, we conclude that SEWePs semantic enhancement of the personalization
process improves the quality of the recommendations in terms of complying with the
users needs.
A general observation that was made after examining each users preferences
individually is that the recommendations usefulness is a very subjective issue.
Depending on the circumstances, each users preferences differ from the other users, and
sometimes even from his own. Thus, an objective qualitative evaluation of such
systems is quite difficult.

3.9 System Prototypes


In this Section we present the SEWeP prototype [PL+04], emphasizing on some
implementation details. We also briefly present two prototypes partly based on some of
the proposed frameworks components, namely, the IKUM and the Greek Web Archive.
More details on the personalized content delivery platform IKUM [VEA03, EVA05], as
well as the Greek Web Archiving framework [LE+04] can be found in the respective
publications.
3.9.1 SEWeP Prototype
The SEWeP system prototype is entirely based on Java (JDK 1.4 or later). For the
implementation of SEWeP we utilized the following third party tools & algorithms: PDF
Box Java Library, Jacob Java-Com Bridge Library, and swf2html library (for text
extraction); Xerces XML Parser; Wordnet v1.7.1 Ontology; JWNL and JWordnet 1.1
java interfaces for interaction with Wordnet; Porter Stemming Algorithm [P80] for
English; Triantafillidis Greek Grammar [Tria]; Apache Tomcat 4.1 and Java Servlets for

43

the recommendation engine; JDBC Library for MS SQL Server. The main functionalities
of the prototype are described below:
Logs Preprocessing: The system provides full functionality for preprocessing any
kind of web logs, by enabling the definition of new log file templates, filters
(including/excluding records based on field characteristics), etc. The clean logs are
stored in new files. A screenshot of the log preprocessing module is shown in Figure 11.
Content Retrieval: The system crawls the web and downloads the web sites pages,
extracting the plain text from a variety of crawled file formats (html, doc, php, ppt, pdf,
flash, etc.) and stores them in appropriate database tables.
Keyword Extraction & Translation: The user selects among different methods for
extracting keywords. Prior to the final keywords selection, all non-English keywords are
translated using an automated process (the system currently also supports Greek content).
All extracted keywords are stored in a database table along with their relevant frequency.
Keyword Category Mapping: The extracted keywords are mapped to categories of a
domain-specific ontology. The system finds the closest category to the keyword
through the mechanisms provided by a thesaurus (WordNet [WN]). The weighted
categories are stored in XML files and/or in a database table.
Session Management: SEWeP enables anonymous sessionizing based on distinct IPs
and a user-defined time limit between sessions. The distinct sessions are stored in XML
files and/or database tables. Figure 12 includes a screenshot of this module.
Semantic Association Rules Mining: SEWeP provides a version of the Apriori
algorithm [AS94] for extracting frequent itemsets and/or association rules (confidence
and support thresholds set by the user). Apart from URI-based rules, the system also
provides functionality for generating category-based rules. The results are stored in text
files for further analysis or use by the recommendation engine. Figure 13 includes a
screenshot of this module.
Clustering: SEWeP integrates clustering facilities for organizing the documents into
meaningful semantic clusters. Currently SEWEP capitalizes on the clustering tools
available in the THESUS system [VV+04].

44

Recommendations: The (semantic) association rules/frequent itemsets created feed a


client-side application (servlet) in order to dynamically generate recommendations to the
visitor of the personalized site.

Figure 11. SEWeP screenshot: The Logs Preprocessing module

45

Figure 12. SEWeP screenshot: the Session Management module

Figure 13. SEWeP screenshot: the Semantic Association Rules Mining module

46

3.9.2 The I-KnowUMine Project


The objective of the I-KnowUMine (IKUM) project is the development of a content
delivery platform based on content, knowledge and behavioral data in order to present
personalized content to the users. This approach benefits from the combination and
integration of technology advances in areas such as web mining, content management,
personalization and portals. The IKUM system modules are classified into four main
layers: the Content Management Layer, Web Mining Layer, Knowledge Management
Layer, and Interaction Layer, as illustrated in the block diagram of Figure 14.

Figure 14. The IKUM system architecture


The Content Management Layer is based on the content characterization component
of the SEWeP architecture. It incorporates the Content Management Module, the
Taxonomy Management Module and the Content Classification Module. The main
functionalities implemented in this layer are the support for consistent authoring and
storage of the content of the web site, its enrichment with semantic information,
generated automatically or corrected/provided by a domain expert, the support for

47

creating or importing taxonomies as well as the support for administrative functions such
as workflow and user management.
The Web Mining Layer is based on the C-logs creation & mining components of the
SEWeP architecture, enabling the semantic enhancement of web logs in order to create
the C-logs, which are in turn used as input to the Web Mining Module.
The Knowledge Management Layer is responsible for managing the knowledge
generated by the Web Mining layer and includes its deployment through various
recommendation engines (Recommendation Module).
Apart from these three general layers there is also an Interaction Layer, which
includes the Publishing Module and the web server, which will present the corresponding
personalized page to every user, by combining possibly fixed parts of the web page
with parts where the personalized information should be presented. More details on the
IKUM project can be found in [VEA03, EVA05].
3.9.3 The Greek Web Archiving Project
The objective of this project is to propose a framework for archiving the Greek Web.
This process involves the creation of an archive containing as many Greek web pages
as possible as well as the knowledge extraction from this collection. What should be
characterized as Greek Web is not solid since there exist many Greek web sites that are
not under the .gr top-level domain. Therefore, the main criteria we use in order to define
the Greek perimeter, apart from the domain name, are the Greek language and the
Hellenic-oriented content. In addition to collecting the data though, we also perform a
semantic characterization of the pages in order to group them into thematic clusters.
These clusters can subsequently be used to accelerate the search in the Web Archive and
enable the keyword-based search without human intervention.
The Greek Web Archiving system architecture is depicted in Figure 15. The system
consists of three main components: the Web Crawler, the Content Manager and the
Clustering Module. The Web Crawler searches the web using the aforementioned criteria
in order to gather as many Greek web pages as possible. The collected URIs are stored
in a database along with the date and time the crawling was performed, to enable
updating of the archive in the future. Some additional information such as the web pages

48

that point to, or are pointed by the URI can also be included for future use. The Content
Manager is in essence the Content characterization component of the SEWeP
architecture. Finally, the Clustering Module uses the K-means or the DBSCAN algorithm
and generates a label for each created cluster, taking the cluster centroids into account.
The system also integrates a Cluster Validation sub-module, in order to evaluate the
quality of the created clusters. Since the structure of the data was not known a-priori, we
used relative cluster validation criteria [HBV02], including the Dunn index, modified
Hubert Statistics and Davies-Boudlin index for this purpose. More information on this
project can be found in [LE+04].

Figure 15. The Greek Web Archiving system architecture

49

3.10 Conclusions
In this Chapter we presented the key concepts and algorithms underlying SEWeP, a
novel semantic web personalization system. SEWeP is based on the integration of content
semantics with the users navigational behaviour in order to generate recommendations.
The web sites documents are automatically mapped to ontology terms, enabling further
processing (clustering, association rules mining, recommendations generation) to be
performed based on the semantic similarity between these terms. Using this
representation, the final recommendation set presented to the user is semantically
enhanced, overcoming problems emerging when pure usage-based personalization is
performed. Experimental results with real users have verified our claim that the semantic
enrichment of the personalization process improves the quality of the recommendations
in terms of complying with the users needs. Nevertheless, the recommendations
usefulness is a very subjective issue and therefore very difficult to evaluate. A general
observation that can be made is that out of the three possible recommendation sets the
system generates, the Semantic recommendation set, generated after semantic expansion
of the most popular association rule performs better, yet comparing all three
recommendation sets with the Hybrid one, we can conclude that this setup is the most
useful than the other three.

CHAPTER 4
Link Analysis for Web Personalization
The connectivity features of the web graph play important role in the process of web
searching and navigating. Several link analysis techniques, based on the popular
PageRank algorithm [BP98], have been largely used in the context of web search
engines. The underlying intuition of these techniques is that the importance of each page
in a web graph is defined by the number and the importance of the pages linking to it.
In this thesis, we introduce link analysis in a new context, that of web personalization.
Motivated by the fact that in the context of navigating a web site, a page/path is
important if many users have visited it before, we propose a new algorithm UPR (Usagebased PageRank). UPR is based on a personalized version of PageRank, favoring pages
and paths previously visited by many web site users. We apply UPR to a representation
of the web sites user sessions, termed Navigational Graph in order to rank the web sites
pages. This ranking may then be used in several contexts:

Use it as a global ranking of the web sites pages. The computed rank
probabilities can serve as the prior probabilities of the pages when
recommendations are generated using probabilistic predictive models such as
Markov Chains, higher-order Markov models, tree synopses etc.

Apply UPR to small subsets of the web sites navigational graph (or its
approximations), which are generated based on each current users visit. This

52

localized version of UPR (named l-UPR) provides localized personalized


rankings of the pages most likely to be visited by each individual user.
In what follows we illustrate our approach through a motivating example. We then
provide the required theoretical background on link analysis before presenting the
proposed algorithm. We prove that this hybrid algorithm can be applied to any web sites
navigational graph as long as the graph satisfies certain properties. We then proceed with
describing the two proposed frameworks in which UPR can be applied, namely, the
localized personalized recommendations with l-UPR and the hybrid probabilistic
predictive models (h-PPM). We conclude with an extensive experimental evaluation we
performed on both frameworks (l-UPR and h-PPM), proving our claim that the
underlying link structure of the web sites should be taken into consideration in the web
personalization process, and details on the system prototype we used. Parts of this chapter
have appeared at [EV05, EVK05].

4.1 Motivating Example


Assume that the graph of Figure 16 represents a fraction of the web. The nodes
represent the web pages and the edges represent the links between them. Suppose that a
user U has already visited pages a and b. We want to predict which is the most probable
path U will follow next, namely, cd or ce. We notice from this small fraction of the
web graph, that d is linked to by more pages than e. Based on the assumption that a web
page is important if it is pointed to by many other (important) pages, it seems that d is
more important, therefore the most probable path is the one ending on this page.
This simple example shows in brief how PageRank works. PageRank is an iterative
algorithm which assigns visit probabilities (authority scores) to the pages of a web graph.
In each iteration the authority score of a page is distributed evenly to the pages it points
to. The more authority a page has, the more authority its out-linked pages will receive.
Conversely, the more the incoming links of a page, the more authority it accumulates. By
applying PageRank to the entire web graph, we get an importance ranking of all the
pages it includes. This ranking has been very useful in the context of web search.

53

Figure 16. PageRank-based example


Assume now that the graph of Figure 17 represents a fraction of a web sites graph.
The weights on the edges represent the number of one-step user transitions between the
connected nodes (pages). Based on these weights, we observe that page e has been visited
by more people have than page d, therefore we may claim that in this navigational
graph ce seems to be more important than cd in terms of users interest. We may
therefore extend the PageRank assumption and claim that, in the web navigation context,
a page is considered important if many users have visited it before.

Figure 17. Usage-based PageRank (UPR) example

54

This is the intuition behind the hybrid algorithm we propose in this thesis, UPR. UPR
is in essence a usage-based variation of PageRank which is applied to the web sites
navigational graph, and provides us with authority scores that represent the importance of
the web sites pages both in terms of link connectivity, as well as their visit frequency.
In what follows we present some preliminaries concerning the navigational graph in
Section 4.2. Section 4.3 includes a detailed analysis of both the PageRank and UPR
algorithms. The two proposed personalization frameworks in which UPR can be applied
are presented in Sections 4.4 and 4.5. The experimental study is detailed in Section 4.6
whereas an overview of the prototype system implementing the proposed frameworks is
included in Section 4.7.

4.2 Preliminaries
The input to our proposed algorithm is the Navigational Graph (NG). NG is a
weighted directed graph representation of the user sessions. NG can be used in order to
discover page and path probabilities and support popular path prediction, since it contains
all the distinct user sessions, therefore is a full representation of the actual user paths
followed in the past. This structure, however, can become large, especially when
modeling the user sessions of big web sites. Therefore, the processing of NG may become
very intensive computationally. The need for reduced complexity and online availability
imposes the creation of approximations of the NG, referred to as NG synopses. An NG
synopsis may be a Markov model of any order (depending on the simplicity/accuracy
trade-off that is required), or any other graph synopsis, such as those proposed in [PG02,
PGI04]. We should stress at this point that our approach is orthogonal to the type of
synopsis one may choose. In what follows we present in more detail the NG structure and
its synopses, emphasizing on Markov models, since these are the NG synopses we are
using in the second framework we propose in this thesis as well as in the experimental
study we performed.
4.2.1 The Navigational Graph
As already mentioned, the Navigational Graph (NG) is a weighted directed graph
which represents the user sessions of a web site. In its simplest form, NG is a node- and

55

edge-labeled tree, that has as root a special node R and the labels of the nodes identify the
M web pages of the web site WS. Another option would be to encode the data as a graph
using a bisimulation of the tree-based representation. We stress that this choice is
orthogonal to the techniques that we introduce. The edges of NG represent the links
between the web pages (i.e. the paths followed by the users), and the labels (weights) on
edges represent the number of link traversals. The weighted paths from the root towards
the leaves represent all the user sessions paths that are included in the web logs. All tree
paths terminate in a special leaf-node E denoting the end of a path. The NG resembles to
the web sites graph, it may, however, include page links that do not physically exist (if,
for example a user jumps to a page from another following a bookmark), or, on the other
hand, may not include existing hyperlinks, if they were never followed in the past. Since
NG is a complete representation of the information residing on the web logs, there is a
high degree of replication of states in different parts of this structure.
The NG creation algorithm is as follows: For every user session US in the web logs,
we create a path starting from the root of the tree. If a subsequence of the session already
exists we update the weights of the respective edges, otherwise we create a new branch,
starting from the last visited common page in the path. We note that any consecutive
pages repetitions have been removed from the user sessions during the data cleaning
process; on the other hand, we keep any pages that have been visited more than once, but
not consecutively. We also denote the end of a session using a special exit node. The
algorithm for creating the NG is detailed in Figure 18.

56

Procedure CreateTree(U)
Input: User Sessions U
Output: Navigational Tree *NG
1. root <- NG;
2. tmpP <- root;
3. for every USU do
4. while US do
5.
si = first_state(US);
6.
if parent(tmpP,si) then
7.
wtmpP,I = wtmpP,I + 1;
8.
tmpP <- si;
9.
US <- remove(US, si);
10. else
11.
addchild(tmpP,si);
12.
wtmpP,I = 1;
13.
tmpP <- si;
14.
US <- remove(US, si);
15. endif
16. if parent(tmpP,E) then
17.
wtmpP,E = wtmpP,E + 1;
18. else
19.
addchild(tmpP,E);
20.
wtmpP,E = 1;
21. endif
22. done
23. tmpP <- NG;
24.done

Figure 18. NG Creation Algorithm


In order to make this process clearer, we present a simple example. Assume that the
user sessions of a web site are those included in Table 3. The Navigational Graph created
after applying the aforementioned algorithm is depicted in Figure 19.
Table 3. User Sessions
User Session #

Path

1
2
3
4
5

abcd
abed
acdf
bcbg
bcfa

57

Figure 19. Navigational Graph


4.2.2 Markov Models
As already stated, NG can become large as it contains redundant information (such as
recurring sub-paths). As a consequence, performing computations directly over the NG
can become prohibitively expensive. The need for reduced complexity and online
availability imposes the creation of NG synopses, for reducing the NG structure size.
These synopses capture the sequential dependence between visits up to some level, while
preserving their most important statistical characteristics. The more detailed the synopsis
is, the more accurate will the representation of NG be. On the other hand, the construction
of a less detailed synopsis will save time and computational power. In this thesis we
elaborate on Markov models since these are the synopses used in our proposed
frameworks and experimental study.
Every node in NG may be considered as a state in a discrete Markov model, and may
be defined as a tuple <S, TP, L>, where S is the state space, which includes all nodes in
the transition graph, TP is the one-step transition probability matrix, and L is the initial
probability distribution regarding the states in S. In that model, the navigation of a user

58

may be represented as a stochastic process {Xn}, that has S as state space. If Pi ,( mj ) is the
bounded probability of visiting page j in the next step, and is based on the last m pages,
then {Xn} is called an mth-order Markov model [Kij97].
The simplest synopsis of NG is a Markov Chain. The Markov Chain is built upon the
Markov Property, which states that each next visit to a page depends only on the
current one and is independent of the previous ones. Therefore, a Markov Chain is a 1storder Markov model, and the bounded probability of visiting page xj in the next step is
given by Equation 4:

Pi ,(1j ) = P X n +1 = x j | X n = xi , X n 1 = xin 1 ,..., X 0 = xi0 = P ( X n +1 = x j | X n = xi )

(4)

This representation is simple to construct, depends, however, on the assumption that


the navigation is memoryless, in other words that the next page to be visited by a user
only depends on the users current location. NG synopses that take into consideration the
long-term memory aspects of web surfing are higher order Markov models, which can
easily be constructed from NG by computing the k-step path frequencies (where k is the
order of the model). In essence, higher-order Markov models relax the Markov
property. Therefore, given that the user is currently at page xi and has already visited
pages xi n 1 ,..., xi0 , then Pi ,( mj ) is based only on pages xi , xin 1 ,..., xin m +1 and is given by

Equation 5:

) (

Pi ,( mj) = P X n+1 = x j | X n = xi , X n1 = xin 1 ,..., X 0 = xi0 = P X n+1 = x j | X n = xi ,..., X nm+1 = xin m+1

(5)

where the bounded probability of {Xn+1}, given all the previous events, equals the
bounded probability of {Xn+1} given the m previous events, for an mth order Markov
model.
The transition probabilities are easily computed using the information residing on
NG. We define the one-step transition probability matrix TP as follows: each item TPi,j
represents the probability of transitioning from page(s) xi to page xj in one step. In other
words,
TPi , j = P( x j | xi ) =

wij
wi

(6)

59

where wi represents the total number of visits to page(s) xi, and wij represents the number
of consecutive visits from xi to xj. Note that in case of paths having length l>1, we denote
as xi the prefix containing the first l-1 pages.
Table 4. Path Frequencies
l=1
xi
a
b
c
d
e
f
g

l=2
wi
4
5
4
3
1
2
1

xi xj
ab
ac
bc
be
bg
cb
cd
cf
df
ed
fa

l=3
wij
2
1
3
1
1
1
1
1
1
1
1

xi xj
abc
abe
acd
bcb
bcd
bcf
bed
cbg
cdf
cfa

wij
1
1
1
1
1
1
1
1
1
1

Table 4 includes the paths of length l 3 corresponding to the user sessions included
in Table 3. Using this information, and based on the previous analysis, we can compute
the transition probabilities for 1st and 2nd order Markov models NG synopses. The
respective 1st order Markov model (Markov Chain) synopsis is depicted in Figure 20. The
numbers in parentheses in the nodes denote the number of visits to a page whereas the
edges weights denote the times the respective link was followed. Nodes S and E
represent the paths start and end points respectively.

Figure 20. NG synopsis (Markov Chain)

60

In the analysis that follows we use Markov models in two different frameworks. In
the first, we apply them in order to synopsize NG prior to applying the proposed localized
personalized ranking algorithm l-UPR. In the second, we propose Markov model-based
hybrid predictive models that incorporate link analysis techniques.

4.3 Usage-based PageRank


So far, link analysis has been largely used in the context of web search. In this thesis,
we introduce link analysis techniques in the web personalization process. We propose
UPR, a hybrid PageRank-style algorithm for ranking the pages of a web site based on its
links connectivity as well as its usage, in order to assist the recommendation process. In
what follows we present the original PageRank algorithm as proposed by Brin and Page
[BP98]. We then provide the formal definition of the proposed algorithm, Usage-based
PageRank (UPR).
4.3.1 PageRank

The PageRank algorithm is the most popular link analysis algorithm, used for
assigning numerical weightings to web documents that are used from web search engines
in order to rank the retrieved results. The algorithm models the behavior of a random
surfer, who either chooses an outgoing link from the page he is currently visiting, or
jumps to a random page. Each choice bears a probability. The PageRank of a page is
defined as the probability of the random surfer visiting this page at some particular time
step k > K. This probability is correlated with the importance of this page, as it is defined
based on the number and the importance of the pages linking to it. For sufficiently large
K this probability is unique, as illustrated in what follows.
Consider the web as a directed graph G, where the N nodes represent the web pages
and the edges represent the links between them. The random walk on G induces a
Markov Chain where the states are given by the nodes in G, and M is the stochastic
transition matrix with mij describing the one-step transition from page xj to page xi. The
adjacency function mij is 0 if there is no direct link from xj to xi, and normalized such that,
for each j:

61

ij

=1

(7)

i =1

As stated by the Perron-Frobenius theorem, if M is irreducible (i.e. G is strongly


connected) and aperiodic, then Mk (i.e. the transition matrix for the k-step transition)
r
converges to a matrix in which each column is the unique stationary distribution PR * ,
r
independent of the initial distribution PR . The stationary distribution is the vector which
satisfies the Equation:

r
r
PR * = M PR *

(8)

r
in other words PR * is the dominant eigenvector of the matrix M.
Since M is the stochastic transition matrix over the web graph G, PageRank is in
essence the stationary probability distribution over pages induced by a random walk on

G. As already implied, the convergence of PageRank is guaranteed only if M is


irreducible and aperiodic [MR95]. The latter constraint is guaranteed in practice in the
web context, since the visits to a web page do not usually follow a periodic pattern. The
irreducibility is satisfied by adding a damping factor (1-) to the rank propagation (the
damping factor is a very small number, usually set to 0.15), in order to limit the effect of
rank sinks and guarantee convergence to a unique vector. We therefore define a new
matrix M by adding low-probability transition edges between every pair of nodes in G:
M ' = (1 ) M + U

(9)

In other words, the user may follow an outgoing link, or choose a random destination
(usually referred to as random jump) based on the probability distribution of U. The latter
process is also known as teleportation. PageRank can then be expressed as the unique
solution to Equation 8, if we substitute M with M:
r
r
r
P R = (1 ) M P R + p

r
where p is a non-negative N-vector whose elements sum to 1.

(10)

62

Usually mij =

, where Out(xj) is the set of pages pointed to by xj, and

k
xk Out ( x j )

1
U =
, i.e. the probability of teleporting to another page is uniform. In that case
N N N
r 1
p= .
N N 1

r
By choosing, however, U, and consequently p , to follow a non-uniform distribution,
we can bias the PageRank vector computation to favor certain pages (therefore the
r
random jump is no longer random!) Thus, p is usually referred to as the

personalization vector. This approach is largely used in the web search engines context,
where the ranking of the retrieved results are biased by favoring pages relevant to the
query terms, or the user preferences to certain topic categories [ANM04, H02, RD02,
WC+02]. In what follows, we present UPR, a usage-based personalized version of
PageRank algorithm, used for ranking the pages of a web site based on the navigational
behavior of previous visitors.
4.3.2 UPR: Link Analysis on the Navigational Graph

Based on the intuition that a page is important in a web site if many users have visited
it before, we introduce the hybrid link analysis algorithm UPR. UPR extends the
traditional link analysis algorithm PageRank, by biasing the page ranking with
knowledge acquired from previous user visits, as they are recorded in the user sessions.
In order to perform this, we define both the transition matrix M and the personalization
r
vector p in such way that the final ranking of the web sites pages is strongly related to
the frequency of visits to them.
Recapitulating from Section 4.2, we define the directed navigational graph NG, where
the nodes represent the web pages of the web site WS and the edges represent the
consecutive one-step paths followed by previous users. Both nodes and edges carry
weights. The weight wi on each node represents the number of times page xi was visited
and the weight wji on each edge represents the number of times xi was visited

63

immediately after xj. We denote the set of pages pointed to by xj (outlinks) as Out(xj), and
the set of pages pointing to xj (inlinks) as In(xj).
Following the aforementioned properties of the Markov theory and the PageRank
r
computation, the Usage-based PageRank vector UPR is the solution to the following
Equation:
r
r
r
UPR = (1 ) M UPR + p

(11)

The transition matrix M on NG is defined as the square N x N matrix whose elements mij
equal to 0 if there does not exist a link (i.e. visit) from page xj to xi and
mij =

w j i

xk Out ( x j )

(12)
jk

r
otherwise. The personalization vector p is defined as

r wi
p=
wj
x

j WS N 1

(13)

Using the aforementioned formulas, we bias the PageRank calculation to assign a


higher rank to the pages that were visited more often by users in the past. We then use
this hybrid ranking, combining the structure and the usage data of the site, to provide a
ranked recommendation set to current users, as we describe in the subsequent sections.
Note that Equation 7 holds, that is, M is normalized such that the sum of each column
equals to 1, therefore M is a stochastic transition matrix, as required for the convergence
condition of the algorithm to hold. M is, as already mentioned, aperiodic in the web
context and irreducible since we have included the damping factor (1-). It is therefore
r
guaranteed that Equation 11 will converge to a unique vector, UPR * .
Definition (UPR): We define the usage-based PageRank UPRi of a web page xi as the n-

th iteration of the following recursive formula:

w ji
n 1
UPRi = UPR j
x j In ( xi )
( xw)jk

Out
k
j

wi
+ (1 )
wj

WS
j

(14)

64

Each iteration of UPR has complexity O(n2). The total complexity is thus determined
by the number of iterations, which in turn depends on the size of the dataset. In practice,
however, PageRank (and accordingly UPR) gives good approximations after 50 iterations
for =0.85 (which is the most commonly used value, recommended in [BP98]). The
computations can be accelerated by applying techniques such as those described in
[KHG03, KH+03] even though it is not necessary in the proposed frameworks since UPR
is applied to a single web site, therefore it converges after a few iterations.
In the Sections that follow, we present how UPR can be applied in different
personalization frameworks in order to assist the recommendations process.

4.4 Localized UPR (l-UPR)


The UPR algorithm can be applied to a web site in order to rank its web pages taking
into consideration both its link structure and the paths followed by users, as recorded in
the web logs. This process results in a global usage-based ranking of the web sites
pages. In the context of web site personalization, however, we want to bias this
algorithm further, focusing on the path the current visitor has followed and the most
probable next pages he might visit, i.e. generating a localized personalized ranking.
We select a small subset of the NG synopsis we have modeled the user sessions with,
based on the current users path. This sub-graph includes all the subsequent (to the
current visit) pages visited by users with similar behavior in the past, until a predefined
path depth d. Therefore, it includes all the potential next pages of the current users
visit. l-UPR (localized UPR) is in essence the application of UPR on this small,
personalized fraction of the navigational graph. The resulting ranking is used in order to
provide recommendations to the current visitor. This approach is much faster than
applying UPR to the NG synopsis since the size of the graph is dramatically reduced,
therefore enabling online computations. Moreover, the ranking results are personalized
for each individual user, since they are based on the current users visit and similar users
behavior in the past. We present the process of creating the personalized sub-graph,
termed prNG, and the recommendation process in more detail below.

65

4.4.1 The Personalized Navigational Graph (prNG)

In short, the process of constructing the personalized sub-graph is as follows: We


expand (part of) the path already visited by the user, including all the outgoing links (i.e.
the pages and the respective weighted edges) existing in the NG synopsis. The length of
the path taken into consideration when expanding the graph depends on the NG synopsis
we have used (in the case of Markov model synopses this represents the desired
memory of the system). We subsequently perform this operation for the new pages (or
paths), until we reach a predefined expansion depth. We then remove any pages that
already have been visited by the user, since these dont need to be included in the
generated recommendations. The children of the node (page) that is removed are linked
to its parent. This ensures that all the previously visited pages by users having similar
behavior will be kept in the final sub-graph, without including any higher-level pages
they might have used as hubs for their navigation. After reaching the final set of nodes,
we normalize each nodes outgoing edge weights.
Before proceeding with the technical details of this algorithm, we illustrate its
functionality using two examples, based on the sessions included in Table 3, and the
respective path frequencies of Table 4. In both examples we create the prNGs for two
user visits including the paths {a b} and {b c}. In the first example, we assume that
the sessions are modeled using a Markov Chain NG synopsis. Using the path frequencies
for l=2 (i.e. the one-step transitions), we expand the two paths, {a b} and {b c}, to
create the respective prNGs, as shown in Figure 21. The second example is based on a
2nd-order Markov model NG synopsis. Note that in this case we use the path frequencies
for l=3. The corresponding prNGs for the two paths are illustrated in Figure 22. The
outgoing edge weights of each node are normalized so that they sum to 1. We also
observe that the nodes included in each prNG depend on the NG synopsis we choose to
model the user sessions with.

66

Figure 21. prNG of Markov Chain NG synopsis

Figure 22. prNG of 2nd order Markov model NG synopsis

The prNG construction algorithm is presented in Figures 23 and 24. The algorithm
complexity depends on the synopsis used, since the choice of the synopsis affects the
time needed for locating the successive pages for expanding the current path. It also
depends on the number of outgoing links of each sub-graphs page and the expansion
depth, d. Therefore, if the complexity of locating successive pages in a synopsis is k, the
complexity of the prNG creation algorithm is O (k * fanout ( NG ) d 1 ) , where fanout(NG)
is the maximum number of a nodes outgoing links in NG. In the case of Markov model
synopses, k=1 since the process of locating the outgoing pages of a page or path reduces
to the lookup in a hash table.

67

Procedure Create_prNG(CV, NG)


Input: Current User Visit CV,
Navigational Graph NG
Output: Subset of NG prNG
1. start
2. CV = {vp};
3. cp = lastVisitedPath(CV);
4. expand(cp, NG, depth, expNG);
5. removeVisited(expNG, CV);
6. updateEdges(expNG);
7. prNG = normalize(expNG);
8. end

Figure 23. Construction of prNG

Procedure expand(cp, NG, d, eNG)


Input: last page/path visited cp, navigational
graph synopsis NG, depth of expansion d
Output: expanded navigational graph eNG
1. start
2. P := cp;
3. R:= rootNode(eNG);
4. tempd = 0;
5. addNode(eNG, R, cp);
6. while (tempd <= d)do
7. for every (pP of same level)do
8.
forevery np = linksto(NG, p, np, w)do
9.
addNode(enG, p, np, w);
10.
P += np;
11. done;
12. done;
13. tempd +=1;
14.done;
15.end

Figure 24. Path expansion subroutine

Since the resulting prNG includes all possible next page visits of the user, we then
apply UPR in order to rank them and generate personalized recommendations. The
personalized navigational sub-graph prNG should be built so as to retain the desirable
attributes for UPR to converge. The irreducibility of the sub-graph is always satisfied
since we have added the damping factor (1-) in the rank propagation. Moreover,
Equation 7 which states that the sum of all outgoing edges weights of every node in the

68

sub-graph equals to 1, is satisfied since we normalize them. Note here that prNG does not
include any previously visited pages.
Definition (l-UPR): We define l-UPRi of a page xi as the UPR rank value of this page in

the personalized sub-graph prNG.


These l-UPR rankings of the candidate pages are subsequently used to generate a
personalized recommendation set to each user. This process is explained in more detail in
the following Section.
4.4.2 UPR-based Personalized Recommendations

The application of UPR or l-UPR to the navigational graph results in a ranked set of
pages which are subsequently used for recommendations. As already presented, the final
set of candidate recommendation pages can be either personalized or global, depending
on the combination of algorithm - navigational graph chosen:
1) Apply l-UPR to prNG. Since prNG is a personalized fraction of the NG synopsis,
this approach results in a personalized usage-based ranking of the pages most
likely to be visited next, based on the current users path.
2) Apply UPR to NG synopsis. This approach results in a global usage-based
ranking of all the web sites pages. This global ranking can be used as an
alternative in case personalized ranking does not generate any recommendations.
It can also be used for assigning page probabilities in the context of other
probabilistic prediction frameworks, as we will describe in the Section that
follows.
Finally, another consideration would be to have a pre-computed set of
recommendations for all popular paths in the web site, in order to save time during the
online computations of the final recommendation set.

4.5 Web Path Prediction using hybrid Probabilistic Predictive Models


One of the most popular web usage mining methods is the use of probabilistic
models. Such models represent the user sessions as a graph whose nodes are the web

69

sites pages and edges are the hyperlinks between them, and are in essence based in what
we have already described as NG synopses. Using the transitional probabilities between
pages as defined by the probabilistic model, a path prediction is made by selecting the
most probable path among candidate paths, based on each users visit. Such purely usagebased probabilistic models, however, present certain shortcomings. Since the prediction
of users' navigational behavior is solely based on the usage data, the structural properties
of the web graph are ignored. Thus important paths may be underrated. Moreover, as we
will also see in the experimental study we performed, such models are often shown to be
vulnerable to the training data set used.
In this Section we present a hybrid probabilistic predictive model (h-PPM) that
extends Markov models by incorporating link analysis methods. More specifically, we
choose the Markov models as NG synopses and use UPR and two more PageRank-style
variations of it, for assigning prior probabilities to the web pages based on their
importance in the web site's web and navigational graph.
4.5.1 Popular Path Prediction

As already presented, Markov models provide a simple way to capture sequential


dependence when modeling the navigational behavior of the users of a web site. The
order of the Markov model indicates the memory of the prediction, i.e. denotes the
number of previous user steps which are taken into consideration in the process of
calculating the path probabilities. For example, in Markov Chains the probability of
visiting a page depends only on the previous one, in 2nd-order Markov models depends on
the previous two, and so on. The selection of the order influences both the prediction
accuracy and the complexity of the model while heavily depends on the application/data
set. After building the model, i.e. computing the transition probabilities, the path
probabilities are computed using the chain rule. More specifically, for an m-th order
Markov model, the path probability of following the path x1 x2 ... xk equals to:
k

P( x1 x2 ... xk ) = P( x1 ) * P( xi | xi m ...xi 1 )
i =2

(15)

70

For example, using a Markov Chain as the prediction model, the probability of the path
{abc} reduces to P(a b c) = P(a)P(b | a)P(c | b) = P(a)

P(a b) P(b c)
.
P(a)
P(b)

Based on Equation 15, the prediction of the next most probable page visit of a user is
performed by computing the probabilities of all existing paths having the pages visited so
far by the user as prefix and choosing the most probable one. The bounded probabilities
computation is straightforward since it reduces to a lookup on the transition probability
matrix TP. On the other hand, the prior probability assignment is an open issue, and we
deal with it in the sequel.
4.5.2 Reconsidering Prior Probabilities Computation

There are three approaches used commonly for assigning initial probabilities (priors)
to the nodes of a Markov model. The first one assigns equal probabilities to all nodes
(pages). The second estimates the initial probability of a page p as the ratio of the number
of visits on p as a first page in a path, to the total number of user sessions. In the case of
modeling web navigational behavior, however, neither of the aforementioned approaches
provides accurate results. The first approach assumes a uniform distribution, favoring
non-important web pages. On the other hand, the second does exactly the opposite: favors
only top-level entry pages. Furthermore, in the case of a page that was never visited
first, its prior probability equals to zero. The third approach is more objective with
regards to the other two, since it assigns prior probabilities proportionally to the
frequency of total visits to a page. This approach, however, does not handle important,
yet new (i.e. not included in the web usage logs) pages. Finally, as shown in the
experimental evaluation, all approaches are very vulnerable to the training data used for
building the predictive model.
In the literature, only a few approaches exist where the authors claim that these
techniques are not accurate enough and define different priors. Sen and Hansen [SH03]
use Dirichlet priors, whereas Borges and Levene [BL04] define a hybrid formula which
combines the two options (taking into consideration the frequency of visits to a page as
the first page, or the total number of visits to the page). For this purpose, they define the
variable , which ranges from 0 (for page requests as first page) to 1 (for total page

71

requests). In their experimental study, however, they dont explicitly refer to the optimal
value they used for a.
In this thesis, we address such shortcomings following an alternative approach. Our
motivation draws from the fact that the initial probability of a page should reflect the
importance of this page in the web navigation. We propose the integration of the web
sites topological characteristics, as represented by its link structure, with the navigational
patterns of its visitors used for computing these probabilities. More specifically, we
propose the use of three PageRank-style ranking algorithms for assigning prior
probabilities. The first (PR) is the PageRank algorithm applied on the web sites graph,
and computes the page prior probabilities based solely on the link structure of the web
site. The second is UPR, which, as already described, is applied on the web sites
navigational graph and favors pages previously visited by many users. The third
algorithm (SUPR) is a variation of UPR, which assigns uniform probabilities to the
random jump instead of biasing it as well.
Definition (PageRank-based Prior Probability): We define the prior probability P(xi)

of a page xi as:

P( xi ) = P (n ) ( xi ) = (1 ) * p( xi ) +

(P (

xk In ( xi )

n 1)

( x k ) * p ( x k , xi ) )

(16)

with (1-) being the damping factor (usually set to 0.15) and for
(i) PR (PageRank):
p ( xi ) =

1
M

and

p ( x k , xi ) =

(17)

j
x j Out ( xk )

(ii) SUPR (Semi-Usage PageRank):

p ( xi ) =

1
and
M

p ( x k , xi ) =

wki
wkj

(18)

x j Out ( xk )

(iii) UPR (Usage PageRank):


p ( xi ) =

wi
and p( xk , xi ) =
wj

x j WS

wki
wkj

x j Out ( xk )

(19)

72

Any of the aforementioned ranking schemes can be applied on the web sites web or
navigational graph (or its synopsis), resulting in a probability assignment for each one of
its pages. These probabilities can subsequently be used instead of the commonly used
priors for addressing the aforementioned problems. As we present in the experimental
study we have performed, this approach provides more objective and precise predictions
than the ones generated from the pure usage-based approaches.

4.6 Experimental Evaluation


In this Section we present a set of experiments we performed in order to evaluate the
performance of both recommendation frameworks proposed in this thesis. In the case of
l-UPR, since there is no previous related work to compare it with, we use two different
setups of Markov Chains, which is the NG synopsis we used in l-UPR setup too. Using all
three setups, we generate top-3 and top-5 recommendation sets for 10 different user
paths, and compare them to the actual paths the users followed. In order to evaluate the
incorporation of page ranking in the hybrid probabilistic predictive models (h-PPM), we
compare the top-n path rankings generated by five different setups with the n most
frequent paths. For our experiments, we use two different data sets in order to examine
how the proposed methods behave in various types of web sites.
4.6.1 Experimental Setup

In our experiments we used two publicly available data sets. The first one includes
the page visits of users who visited the msnbc.com web site on 28/9/99 [MSN]. The
visits are recorded at the level of URL category (for example sports, news, etc.). It
includes visits to 17 categories (i.e. 17 distinct pageviews). We selected 96.000 distinct
sessions including more than one and less than 50 page visits per session and split them
in two non-overlapping time windows to form a training (65.000 sessions) and a test
(31.000 sessions) data set. The second data set includes the sessionized data for the
DePaul University CTI web server, based on a random sample of users visiting the site
for a two week period during April 2002 [CTI]. The data set includes 683 distinct
pageviews and 13.745 distinct user sessions of length more than one. We split the
sessions in two non-overlapping time windows to form a training (9.745 sessions) and a

73

test (4.000 sessions) data set. We will refer to these data sets as msnbc and cti data set
respectively. We chose to use these two data sets since they present different
characteristics in terms of web site context and number of pageviews5. More specifically,
msnbc includes the visits to a very big portal. That means that the number of sessions, as
well as the length of paths is very large. This data set has however the characteristic of
very few pageviews, since the visits are recorded at the level of page categories. We
expect that the visits to this web site are almost homogeneously distributed among the 17
different categories. On the other hand, cti data set refers to an academic web site. Visits
to such sites are usually categorized in two main groups: visits from students looking for
information concerning courses or administrative material, and visits from researchers
seeking information on papers, research projects, etc. We expect that the recorded visits
will imply this categorization.
Since in all the experiments we created top-n rankings, in the evaluation step we used
two metrics commonly used for comparing two top-n rankings r1 and r2. The first one,
denoted as OSim(r1,r2) [H02] indicates the degree of overlap between the top-n elements
of two sets A and B (each one of size n) to be:

OSim ( r1 , r2 ) =

A B
n

(20)

The second, KSim(r1,r2) is based on Kendalls distance measure [KG90] and indicates the
degree to which the relative orderings of two top-n lists are in agreement and is defined
as:

KSim( r1 , r2 ) =

(u, v ) : r1 ' , r2 ' have same ordering of (u, v ), u v


A B ( A B 1)

(21)

where r1 is an extension of r1, containing all elements included in r2 but not r1 at the end
of the list (r2 is defined analogously) [H02]. In other words, KSim takes into
consideration only the common items of the two lists, and computes how many pairs of
them have the same relative ordering in both lists. It is obvious that OSim is more
important (especially in small rankings) since it indicates the concurrence of predicted
5

We should note at this point that there does not exist any benchmark for web usage mining and
personalization. We therefore chose these two publicly available datasets which have been used again in
the past for experimentation in the web usage mining and personalization context.

74

pages with the actual visited ones. On the other hand, KSim must be always evaluated in
conjunction with the respective OSim since it can take high values even when only a few
items are common in the two lists.
At this point, we should discuss the methodology we chose for evaluating the
generated recommendations. There exist several related research efforts that propose a
general personalization architecture, without supporting their work with any experimental
evaluation [DM02, GKG05, ML+04, MPT99, OB+03, ZHH02b]. In this work we used a
commonly used methodology, dividing the data set into training and test data
respectively. According to this evaluation methodology, the training data are used in
order to generate the predictive model. The generated recommendations are in turn
compared to the actual user paths, as derived from the test data, using various metrics
[AG03, HLC+05, JPT03, JZM04b, JZM05, MD+00b, MPG03, NP03, SK+01]. Since,
however, the recommendations are compared to paths that have already been followed by
the users, it is questionable whether such a comparison evaluates the quality of
recommendations that include new paths. This issue is partially addressed by most
predictive models, since the generated recommendations include pages that are two or
more steps away. In real-life systems, this problem is addressed when the predictive
model is based on data extracted from an already personalized web site.
4.6.2 l-UPR Recommendations Evaluation

As already mentioned, the choice of the NG synopsis we use to model the user
sessions is orthogonal to the l-UPR framework. In this Section, we present results
regarding the impact of using our proposed method instead of pure usage-based
probabilistic models, focusing on Markov Chains.
We used 3 different setups for generating recommendations. The first two, referred to
as Start and Total, are the ones commonly used in Markov models for computing prior
probabilities. More specifically, Total assigns prior page probabilities proportional to the
total page visits, whereas Start assigns prior page probabilities proportional to the visits
beginning with this page. The third setup, referred to as l-Upr, is in essence our proposed
algorithm applied to a Markov Chain-based prNG. For the l-Upr setup, we set the

75

damping factor (1-) to 0.15 and the number of iterations to 100 to ensure convergence.
We expand each path to depth d=2.
The experimental scenario is as follows: We select the 10 most popular paths
comprising of two or more pages from the test data set. For each such path p, we make
the assumption that it is the current path of the user and generate recommendations
applying the aforementioned approaches on the training data set. Using the first two
setups, we find the n pages having higher probability to be visited after p. On the other
hand, using our approach, we expand p to create a localized sub-graph and then apply lUPR to rank the pages included in it. We then select the top-n ranked pages. This process
results in three recommendation sets for each path p. At the same time, we identify, in the
test data set, the n most frequent paths that extend p by one more page. We finally
compare, for each path p, the generated top-n page recommendations of each method
(Start, Total, l-Upr) with the n most frequent next pages, using the OSim and KSim
metrics.
We run the experiments generating top-3 and top-5 recommendation lists for each
setup. We performed the experiments using small recommendation sets because this
resembles more to what happens in reality, i.e. the system recommends only a few next
pages to the user. The diagrams presented here, show the average OSim and KSim
similarities over all 10 paths.
Figure 25 depicts the average OSim and KSim values for the top-3 and top-5 rankings
generated for the msnbc data set. In the first case (top-3 page predictions) we observe that
l-Upr behaves slightly worse in terms of prediction accuracy (OSim) but all methods
achieve around 50% accuracy. The opposite result is observed in the second case (top-5
page predictions), where l-Upr behaves better in prediction accuracy than the other two
methods, and the overall prediction accuracy is more than average. In both cases we
observe a lower KSim, concluding that l-Upr managed to predict the next pages but not
in the same order (as they were actually visited). As we mentioned earlier, however, the
presentation order is not so important in such a small recommendation list. Overall, the
differences between the three methods are insignificant. This can be justified if we take
into account the nature of the data set used. As already mentioned, the number of distinct

76

pageviews of the data set is very small and therefore the probability of coinciding in the
predictions is the same, irrespective of the method used.
MSNBC data set - top 3 recommendations

MSNBC data set - top 5 recommendations

0.8
average similarity

average similarity

0.7
0.6
0.5

Start
Total
l-UPR

0.4
0.3
0.2
0.1
0
Osim

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

Start
Total
l-UPR

Osim

Ksim

Ksim

Figure 25. Average OSim and KSim of top-n rankings for msnbc data set

In order to conclude on whether the number of distinct pageviews is the one affecting
the prediction accuracy of the three methods, we performed the same experimental
evaluation on the second data set, cti. Figure 26 depicts the average OSim and KSim
values for the top-3 and top-5 rankings generated for the cti data set. We observe that in
both cases l-Upr outperforms the other two methods both in terms of prediction accuracy
(OSim) and relative ordering (KSim). This finding supports our intuition, that in the case
of big web sites that have many pageviews, the incorporation of structure data in the
prediction process enhances the accuracy of the recommendations.
CTI data set - top 5 recommendations

CTI data set - top 3 recommendations


0.9

0.6
average similarity

0.4

Start
Total
l-UPR

0.3
0.2
0.1

average similarity

0.8

0.5

0.7
0.6

Start
Total
l-UPR

0.5
0.4
0.3
0.2
0.1
0

0
Osim

Ksim

Osim

Ksim

Figure 26. Average OSim and KSim of top-n rankings for cti data set

Examining all findings in total, we verify our claim that l-UPR performs the same as,
or better than commonly used probabilistic prediction methods. Even though the
prediction accuracy in both experiments is around 50%, we should point out that this
value represents the average OSim over 10 distinct top-n rankings. Examining the

77

rankings individually, we observed a big variance in the findings, with some


recommendation sets being very similar to the actually visited pages (OSim > 70%),
whereas others being very dissimilar (OSim < 20%). Moreover, the NG synopsis used in
all three setups is the Markov Chain, which is the simplest synopsis model, yet the less
accurate one. We expect better prediction accuracy if the algorithm is applied over a
more accurate NG synopsis and leave this open for future work.
Overall, taking into consideration the low complexity of the proposed algorithm that
enables the fast, online generation of personalized recommendations, we conclude that it
is a very efficient alternative to pure usage-based methods.
4.6.3 h-PPM Recommendations Evaluation

In order to evaluate the impact of incorporating link analysis methods in the


probabilistic prediction process, we used 5 setups of the prediction model, differing in
terms of the prior probabilities computation. The first two setups, termed Start and Total,
are the ones used in previous approaches for computing prior probabilities, as we already
explained in the previous Section. More specifically, Start assigns probabilities
proportional to the visits of a page in the beginning of the sessions, whereas Total assigns
probabilities proportional to the total visits to a page. We do not include the approach of
assigning uniform prior probabilities to all nodes, since it is shown to perform worse than
the other two. The other three setups, termed PR, SUPR, and UPR, assign probabilities
using the respective proposed algorithms defined in Section 4.5.2. We use two NG
synopses for approximating the Navigational Graph NG, namely, the Markov Chain and
the 2nd-order Markov model. For the PageRank-style algorithms, the damping factor (1-)
was set to 0.15 and the number of iterations was set to 100.
Applying the five setups on the training data, we generated a list including the top-n
most probable paths for n{3, 5, 10, 20}. We then compared these results with the top-n
most frequent paths (i.e. the actual paths followed by the users), as derived from the test
data.
The diagrams of Figures 27 and 28 depict the OSim and KSim similarities for the top
3, 5, 10, and 20 rankings of the msnbc data set, using a Markov Chain as NG synopsis and
prediction model. We observe that OSim is around 60% for the two pure usage-based

78

methods, Start and Total, whereas it is more than 80% for the three proposed methods.
KSim, on the other hand, exceeds 90% for all rankings in the case of our proposed
methods, whereas it is high only for the first three rankings for Start setup.

msnbc data set - OSim for MC

OSim

Start

Total

PR

SUPR

UPR

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
top 3

top 5

top 10

top 20

Figure 27. OSim for msnbc data set, Markov Chain NG synopsis
msnbc data set - KSim for MC

KSim

Start

Total

PR

SUPR

UPR

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
top 3

top 5

top 10

top 20

Figure 28. KSim for msnbc data set, Markov Chain NG synopsis

The diagrams of Figures 29 and 30 depict the OSim and KSim similarities for the top
3, 5, 10, and 20 rankings of the cti data set. In this case, the rankings acquired by
applying the two common methods did not match with the actual visits at all, giving a 0%
OSim and KSim similarity! On the other hand, all three proposed methods reached an
average of 80% OSim and 90% KSim in all setups, with SUPR slightly outperforming PR
and UPR.

79

cti data set - OSim for MC

OSim

Start

Total

PR

SUPR

UPR

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
top 3

top 5

top 10

top 20

Figure 29. OSim for cti data set, Markov Chain NG synopsis
cti data set - KSim for MC

KSim

Start

Total

PR

SUPR

UPR

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
top 3

top 5

top 10

top 20

Figure 30. KSim for cti data set, Markov Chain NG synopsis

At this point, we should analyze the behavior of the Start and Total setups, which
represent the straightforward Markov model implementation. The outcomes of the
experiments verify our claim that Markov models are very vulnerable to the training data
used, and several pages may be overrated or underestimated in certain circumstances. In
the case of the msnbc data set, where the number of distinct pages was very small and
therefore the navigational paths were evenly distributed, the pure usage-based models
seem to behave fairly (but, again, worse than the hybrid models). On the other hand, in
the case of the cti data set, where hundreds of distinct pages (and therefore distinct paths)
existed, the prediction accuracy of usage-based models was disappointing! We examined

80

the produced top-n rankings of the two usage-based approaches, and observed that they
include only the visits of students to course material. Since probably many students
visited the same pages and paths in that period of time, accessing the pages directly
(probably using a bookmark), these visits overlapped any other path visited by any other
user. On the other hand, by taking into consideration the objective importance of a
page, as conveyed by the link structure of the web site, such temporal influences are
reduced. The reader may refer to Appendix B that includes the top-10 ranked paths which
were generated using the Start and Total setups, as well as the 10 most frequent ones,
which were used as the test data set on our experiments.
The framework proposed in this Chapter can be directly applied for computing the
prior probabilities of visiting the pages of a web site. In other words, this framework can
be directly applied to Markov Chain NG synopses. In the case of higher-order Markov
models, however, our intuition was that this framework should be extended for
supporting the computation of prior probabilities for path visits (up to some length,
depending on the order). For instance, a 2nd-order Markov model is based on the
assumption that we have prior knowledge concerning the visit probabilities of all paths
including up to 3 pages. Indeed, the results from applying the proposed algorithms to the
cti dataset indicated the need for this model extension. In the case of the msnbc dataset,
however, we did not observe any significant deviation of the results. This can be
explained by the fact that msnbc has only a few distinct nodes, hence a small number of
different distinct paths a user can follow. As already mentioned, in this data set the users
visits were almost uniformly distributed across all web sites page categories. Therefore
the probability of visiting two pages consecutively is very well approximated by the
probability of visiting the last page (almost independent of the page the user was
previously visiting). In what follows, we present the results of this experiment. We
present some preliminary ideas concerning possible extensions of this framework in the
final Chapter.
The results of the set of experiments we performed using the 2nd-order Markov
models as NG synopsis on the msnbc data set are included in the diagrams of Figures 31
and 32. We observe that in the case of 2nd-order Markov models the winner is UPR

81

followed by SUPR and Total setups. A very interesting fact is that the pure link-based
approach, PR, gives the worst results, having 0% OSim for the top-3 and top-5 rankings
and only 20% OSim for the top-10 ranking. This can be explained by the fact that PR,
which is in essence the application of PageRank algorithm on the web sites graph,
represents the steady state vector of the Markov Chain, as it is defined on the web graph.
Therefore, in the case of modeling the web graph as an NG synopsis other than the
Markov Chain, it isnt as efficient. On the other hand, the hybrid usage/link ranking
algorithms outperform the two commonly used usage-based approaches in most cases.
msnbc data set - OSim for 2MM

OSim

Start

Total

PR

SUPR

UPR

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
top 3

top 5

top 10

top 20

Figure 31. OSim for msnbc data set, 2nd-order Markov model NG Synopsis
msnbc data set - KSim for 2MM

KSim

Start

Total

PR

SUPR

UPR

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
top3

top5

top10

top20

Figure 32. KSim for msnbc data set, 2nd-order Markov model NG Synopsis

Overall, comparing the three proposed methods, we observe that, for the msnbc data
set, all methods have the same OSim when a Markov Chain synopsis is used, whereas

82

UPR outperforms the other two when a 2nd-order Markov model synopsis is used. On the
other hand, in the case of the cti data set, we observe that SUPR outperforms the other
two methods. Nevertheless, there is no prevalent underlying pattern between the number
of recommendations and OSim/KSim. Therefore, we cannot conclude on the superiority
of one of the proposed methods, other than that it strongly depends both on the data set
and the NG synopsis used.
4.6.4 Comparison of l-UPR and h-PPM

In the last part of the experimental evaluation, we compared the two proposed
frameworks, namely, l-UPR and h-PPM. For this purpose, we used the same
methodology we followed when evaluating the l-UPR framework, as described in Section
4.6.2: we build the navigational graph from the test data set and select the 10 most
popular paths comprising of two or more pages. For each such path p, we make the
assumption that it is the current path of the user and generate recommendations applying
the aforementioned approaches on the training data set. Using l-UPR, we expand p to
create a localized sub-graph and then apply the algorithm to rank the pages included in
the sub-graph. Using h-PPM, we find the n pages having higher probability to be visited
after p. We then select the top-n ranked pages. At the same time, we identify in the test
data set the n most frequent paths that extend p by one more page. We finally compare,
for each path p, the generated top-n page recommendations of each method (l-UPR, hPPM) with the n most frequent next pages, using the OSim metric. We omit the KSim
results here, since, as already mentioned, they are not very important for such small
recommendation sets.
We applied this methodology to both data sets, generating top-3 and top-5
recommendation sets. For generating recommendations using the h-PPM framework, we
present here the variation that behaved better in the previous experiments, namely, UPR
for the msnbc data set and SUPR for the cti data set. We should point out, however, that
all variations produce almost the same recommendations for such small sets, as already
implied in Section 4.6.3. The experimental results are depicted in Figure 33.

83

cti data set

msnbc data set


h-PPM

0.9

0.9

0.8

0.8

0.7

0.7

0.6
0.5

0.6
0.5

OSim

OSim

l-UPR

l-UPR

0.4

h-PPM

0.4

0.3

0.3

0.2

0.2

0.1

0.1

top 3

top 5

top 3

top 5

Figure 33. Comparison of l-UPR and h-PPM, Markov Chain NG synopsis

We observe that the relevant prediction accuracy of each method depends on the size
of the recommendation set and the data set that is used. Therefore, h-PPM has better
prediction accuracy for small recommendation sets, whereas l-UPR is slightly better for
the bigger recommendation sets. Nevertheless, the differences between the two methods
are minor and we cannot draw any conclusions, other than repeating that the final choice
heavily depends on the data set we want to model.

4.7 System Prototype


In this Section we present the prototype system implementing the aforementioned
approaches, namely, the l-UPR and h-PPM recommendation frameworks. Both
frameworks are integrated in the same prototype as they share the same infrastructure and
some algorithms. Apart from the proposed frameworks, this prototype system implements
Markov models of any order supporting path prediction and recommendations.
The system prototype is entirely based on C# (Microsoft Visual Studio .Net). No
database was used since all the inputs/outputs to the system are files. Anything created
on-the-fly was stored in hash tables. The main functionalities of the prototype are
described below. The names in the parentheses next to the modules names show the
framework each module is used in.
Web Graph Reconstruction (h-PPM & l-UPR): Since this framework implements link
analysis-based algorithms (PageRank (PR) and semi-Usage PageRank (SUPR)), the
knowledge of the web sites link structure is essential. The web structure, however, is not

84

always known. The system provides the web graph reconstruction module, which takes as
input the web sites user sessions and reconstructs the web sites graph based on this
information. The output of this process is an XML file.
Navigational Graph Creation (h-PPM): The proposed algorithm, UPR, is based on
the application of link analysis over the web sites Navigational Graph, NG. This module
enables the creation of NG from the web sites user sessions in order to be used in
subsequent computations. The NG is stored in a hash file.
Prior Probabilities Computation (h-PPM): This module enables the computation of
the prior probabilities defined on Section 4.5.2, namely, PR, SUPR and UPR. The system
also provides the functionality for computing probabilities based on the page visits
frequencies, used by Markov models, namely, Start and Total (prior probabilities are
proportional to the number of visits to a page as the first page in the session or to the total
number of visits to a page, respectively). This process takes as input the parameters of the
chosen probability computation method. The prior probabilities computed using either
one of the five alternative methods are stored in a hash table and used by the Path
Probability Computation module. The results, as well as a log including all the iterations
of the link analysis-based algorithms are also saved in files. A screenshot of this module
is shown in Figure 34.
Path Probabilities Computation (h-PPM): The system enables the popular path
prediction using either the h-PPM framework, or Markov models. The priors used in each
method are pre-computed by the Prior Computation module. This module enables the
prediction of the n most probable next visits for any sub-path of NG, along with the
respective probabilities. It also enables the prediction of the top-n popular paths. This
information is output to files for further analysis. Figure 35 includes a screenshot of this
module.
l-UPR Path Prediction (l-UPR): This module implements the l-UPR recommendation
framework. It incorporates functionality for creating the NG using the web sites user
sessions, whereas the NG synopsis can be a Markov model of any order. The module
takes as input the path (current visitors path) and the parameters of the l-UPR algorithm

85

and outputs the recommended paths in a file. A log file of the l-UPR iteration is also
created. A screenshot of this module is included in Figure 36.
OSim-KSim (h-PPM & l-UPR): This module implements the two similarity measures
employed in our experimental study. This process takes as input two top-n lists and
outputs the respective OSim and KSim similarities.

Figure 34. The Prior Probabilities Computation module

86

Figure 35. The Path Probabilities Computation module

Figure 36. The l-UPR Path Prediction module

87

4.8 Conclusions
There is a wealth of recommendation models for personalizing a web site based on
the navigational behavior of past users. Most of the models, however, are solely based on
the web sites usage data ignoring the link structure of the web graph visited. In this
Chapter we presented how link analysis can be integrated in the web personalization
process. We propose a novel algorithm, UPR, which is applicable to any navigational
graph synopsis, and provides ranked recommendations to the visitors of a web site,
capitalizing on the structural properties of the navigation graph. We presented UPR in the
context of two different personalization frameworks, l-UPR and h-PPM. In the first
framework, a localized version of UPR is applied to a personalized sub-graph of the NG
synopsis and is used to create online personalized recommendations to the visitors of the
web site. The second approach addresses several shortcomings of pure usage-based
probabilistic predictive models, by incorporating link analysis techniques in such models
in order to support popular paths prediction. The experiments we have performed for
both frameworks are more than promising, outperforming existing approaches.

88

CHAPTER 5
Conclusions and Future Research
5.1 Thesis Summary
The World Wide Web grows at a tremendous pace, and its impact as the main source
of information acquisition is increasing dramatically. Because of its rapid and chaotic
growth, the resulting network of information lacks organization and structure, making
web site exploration difficult. To address the requirement of effective web navigation, the
web sites provide personalized recommendations to the end users.
Most of the research efforts in web personalization correspond to the evolution of
extensive research in web usage mining, i.e. the exploitation of the navigational patterns
of the web sites visitors. When a personalization system relies only on usage-based
results, however, valuable information conceptually related to what is finally
recommended may be missed. Moreover, the structural properties of the web site are
often disregarded. In this thesis, we present novel techniques that incorporate the content
semantics and the structural properties of a web site in the web personalization process.
In the first part of our work we present a semantic web personalization system.
Motivated by the fact that if a personalization system is only based on the recorded
navigational patterns, important information that is semantically similar to what is
recommended might be missed, we propose a web personalization framework (SEWeP)
that integrates usage data with content semantics, expressed in ontology terms, in order to

89

compute semantically enhanced navigational patterns and effectively generate useful


recommendations.
To support this framework, we developed various techniques for managing web
content. The web document annotation is performed using both the content of the page,
as well as its connectivity features, thus providing a more objective characterization of
the text. Moreover, the mapping of the extracted features to ontology terms is performed
automatically, as opposed to related research efforts which require manual labeling of the
documents. This uniform representation of the content using ontology terms enables
further processing (clustering, association rules mining, recommendations generation) in
an abstract, semantic level. Therefore, the resulting recommendations are semantically
enhanced and evaluated to be more useful during the users navigation. To the best of our
knowledge, SEWeP is the only semantic web personalization system that performs
automated content characterization and ontology mapping, therefore can be used by any
web site.
In the second part of our work, we propose the incorporation of the underlying
structure of a web site in the web personalization process. We present the novel algorithm
UPR (Usage-based PageRank), a PageRank-style algorithm that relies on the recorded
usage data and link analysis techniques in order to determine the importance of a web
page. We demonstrate how UPR can be applied to any web site in order to rank its web
pages. We then specialize the proposed personalization framework, comprising of the
UPR algorithm as well as an abstraction of the user sessions termed Navigational Graph
in two different contexts.
We introduce l-UPR, a personalized recommendation algorithm. l-UPR is a localized
variant of UPR that is applied to the personalized navigational sub-graph of each user.
We present the algorithm for creating personalized navigational sub-graphs using the
current visit of each user. This sub-graph includes all the possible next visits of the
user, maintaining the properties required for the algorithm to converge. Since l-UPR is
applied only to a small fraction of the actual web graph, its computation is not timeconsuming. Therefore it can be applied online, providing fast and accurate personalized
recommendations to each web sites visitor.

90

Based on the same motivation and framework, we propose a hybrid probabilistic


predictive model, in order to address the pure usage-based probabilistic models
shortcomings. In this model, we use UPR and its variations as a robust mechanism for
determining prior probabilities of page visits. Using this approach, the probabilistic
predictive models are enriched with usage data-independent information, resulting in
more objective and representative predictions than existing techniques that rely solely on
usage data.
The diversity of the two specializations verifies the potential of our approach in
providing an integrated framework for applications of link analysis to web
personalization.

5.2 Discussion
SEWeP was introduced in a time when the Semantic Web vision was rather new, and
people were just starting to exploit some of its principal ideas, structures, languages and
protocols. Since then, some of them were shown to be insufficient and were abandoned,
whereas others, with ontologies being one of them, are now broadly accepted and used in
many different applications. SEWeP exploits ontologies in order to represent web content
and the users navigational behavior. The exploitation of content semantics (either using
ontologies or not) in the web usage mining or the web personalization process has been
the subject of many studies that followed (or were carried on in parallel to) SEWeP
[AG+03, OB+03, ML+04, MSR04, GKG05].
There exist, however, several open issues in this area. Since most of the existing web
sites do not have an underlying semantic infrastructure, and due to the size of most web
sites, it is very difficult to annotate the content by hand, as most approaches imply. It is
evident that the content characterization process should be performed automatically. One
of the most crucial parts of this process is the mapping of the extracted features to
ontology terms, since an inappropriate mapping would eventually result in inaccurate
recommendations. Therefore, one of the most important updates to the SEWeP system
would be to incorporate even more effective semantic similarity measures, such as those
proposed in [MTV04, MT+05].

91

Another important issue concerning the content characterization process is the


processing of multilingual content. Even though all systems are based on the implicit
assumption that the web content is annotated in the same language (usually English), this
is not the case for the majority of web pages. In this thesis we proposed a preliminary
solution for bilingual web sites, focusing on Greek web sites. A possible extension would
be to generalize this framework for multilingual sites, enhancing the translation process
with more effective similarity measures, such as those proposed above.
The semantic web personalization framework presented in this thesis is solely based
on the recorded usage data and the content of the web site. Since knowledge about the
users profiles, as used in the collaborative filtering systems, has been shown to be
valuable in the recommendation process, we would like to extend the proposed
architecture to incorporate such information regarding users ratings, preferences etc. If,
for example, the users preferences are expressed in ontology terms and are available
apriori, the recommendation algorithm could be tuned to promote relevant pages.
Moreover, we should consider the hidden (or deep, or invisible) web, in other words
the pages that are dynamically created as results to queries on underlying databases. In
the last few years, the size of the hidden web pages is rapidly increasing. It cannot,
however, be indexed and processed easily. This is a major problem in the web search
context nowadays, and should be taken into consideration by future personalization
frameworks.
In general, we believe that the incorporation of content semantics in the web
personalization process has been shown to be very useful and this is verified by the fact
that it is nowadays used commonly in real-life recommendation systems. The same holds
for the combination of more than one web usage mining techniques in the web
personalization process.
As far as the second part of this thesis is concerned, we see much potential in the
integration of link analysis in web personalization. Even though link analysis algorithms
have been largely used in other contexts, especially in web searching, in the past few
years, only lately they have been introduced in similar [BG+04] or the same [BL06]
application, that of web usage mining and personalization.

92

In this thesis we have supported the initial intuition that link analysis can be used in
several different contexts in order to support web personalization through an extensive
experimental evaluation process. As we have already pointed out, the priors defined in hPPM framework are directly applicable to Markov Chains, but do not always work for
higher-order Markov models. UPR and its variations compute probabilities for the
navigational graphs nodes, i.e. the web sites pages. In higher order Markov models, we
need such probabilities for the web sites paths too. One solution would be to create
summarizing nodes including all the paths, or only the most popular ones, and then apply
UPR on this aggregate navigational graph. This would result in UPR values for paths
which could subsequently be used in the h-PPM context. This issue remains open for
future work.
Our future plans involve the application of l-UPR on different NG synopses. As
shown in the experimental evaluation, l-UPR is a very promising recommendation
algorithm. In our study we applied it on the Markov Chain NG synopsis. We expect better
results in the case of more complex NG synopses, which approximate more accurately the
navigational graph.
Another issue that should be taken into consideration in the process of assigning
importance scores in the web pages of a web site is the freshness and trends in the
web navigation context. We believe that pages/paths with more recent visits, or
increasing rate of visits should be favored in the recommendation process [BVW04] and
aim to incorporate this intuition in our future work.
Moreover, we plan to investigate how this hybrid usage-structure ranking can be
applied to a unified web/navigational graph which expands out of the limits of a single
web site. Such approach would enable a global importance ranking over the web,
enhancing both web search results and the recommendation process.
Finally, we conclude with our vision for web personalization systems. This thesis has
shown how the integration of content semantics or link analysis techniques can improve
the recommendation process. We believe that the next step would be to exploit all these
data, namely, usage, content and structure, in a single, unified framework. This
framework

need

not

be

specifically

focused

on

generating

personalized

93

recommendations, but should cover many web applications such as web search results
ranking, or web content (blogs/bookmarked pages/multimedia etc.) categorization.

95

LIST OF REFERENCES
[ADW02] C. Anderson, P. Domingos, D. S. Weld, Relational Markov Models and their
Application to Adaptive Web Navigation, in Proc. of the 8th ACM SIGKDD
Conference, Canada (2002)
[AG03] S. Acharyya, J. Ghosh, Context-Sensitive Modeling of Web Surfing Behaviour
Using Concept Trees, in Proc. of the 5th WEBKDD Workshop, Washington DC (2003)
[ANM04] M.S. Aktas, M.A. Nacar, F. Menczer, Personalizing PageRank Based on
Domain Profiles, in Proc. of the 6th WEBKDD Workshop, Seattle (2004)
[AP+04] M. Albanese, A. Picariello, C. Sansone, L. Sansone, A Web Personalization
System based on Web Usage Mining Techniques, in Proc. of WWW2004, New York
(2004)
[AS94] R. Agrawal, R. Srikant, Fast Algorithms for Mining Association Rules, in Proc. of
20th VLDB Conference (1994)
[B02] B. Berendt, Using site semantics to analyze, visualize and support navigation, in
Data Mining and Knowledge Discovery Journal, 6: 37-59 (2002)
[BB+99] A.G. Buchner, M. Baumgarten, S.S. Anand, M.D. Mulvenna, J.G. Hughes,
Navigation pattern discovery from Internet data, in Proc. of the 1st WEBKDD
Workshop, San Diego (1999)
[BG+04] V. Bacarella, F. Giannoti, M. Nannni, D. Pedrsechi, Discovery of Ads Web
Hosts through Traffic Data Analysis, in Proc. of the 9th ACM SIGMOD Workshop on
Research Issues in Data Mining and Knowledge Discovery (DMKD 04), Paris,
France (2004)
[BHS02] B. Berendt, A. Hotho, G. Stumme, Towards Semantic Web Mining, in Proc. of
the 1st Intl. Semantic Web Conference (ISWC 2002)
[BL99] Jose Borges, Mark Levene, Data Mining of User Navigation Patterns, in Web
Usage Analysis and User Profiling, published by Springer-Verlag as Lecture Notes in
Computer Science, 1836: 92-111
[BL04] J. Borges, M. Levene, A Dynamic Clustering-Based Markov Model for Web
Usage Mining, Technical Report, available at http://xxx.arxiv.org/abs/cs.IR/0406032
(2004)
[BL06] J. Borges, M. Levene, Ranking Pages by Topology and Popularity within Web
Sites, accepted for publication in World Wide Web Journal (2006)
[BP98] S. Brin, L. Page, The anatomy of a large-scale hypertextual Web search engine,
Computer Networks, 30(1-7): 107-117 (1998)
[BS00] B. Berendt, M. Spiliopoulou, Analysing navigation behaviour in web sites
integrating multiple information systems, The VLDB Journal 9(1):56-75 (2000)
[BS04] R. Baraglia, F. Silvestri, An Online Recommender System for Large Web Sites, in
Proc. of ACM/IEEE Web Intelligence Conference (WI04), China (2004)

96

[BVW04] K. Berberich, M. Vazirgiannis, G. Weikum, T-Rank: Time-aware Authority


Ranking, 3rd Workshop on Algorithms and Models for the Web-Graph (WAW 2004),
Rome, Italy (2004)
[CD+97] S. Chakrabarti, B. Dom, R. Agrawal, P. Raghavan, Using taxonomy,
discriminants, and signatures for navigation in text databases, in Proc. of the 23rd
VLDB Conference, Athens, Greece (1997)
[CD+98] S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, J. Kleinberg,
Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated
Text, in Proc. of WWW7 (1998)
[CD+99] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, R. Kumar, P. Raghavan, S.
Rajagopalan, A. Tomkins, Mining the Link Structure of the World Wide Web, IEEE
Computer 32(6), (1999)
[CH+00] I.Cadez, D.Heckerman, C.Meek, P. Smyth, S. White, Visualization of
Navigation Patterns on a Web Site Using Model Based Clustering, in Proc. of ACM
KDD2000 Conference, Boston MA (2000)
[CMS97] R. Cooley, B. Mobasher, J. Srivastava, Web Mining: Information and Pattern
Discovery on the World Wide Web, in Proc. of the 9th IEEE International Conference
on Tools with Artificial Intelligence (ICTAI '97)
[CMS99] Robert Cooley, Bamshad Mobasher, Jaideep Srivastava, Data preparation for
mining world wide Web browsing patterns, Knowledge and Information Systems,1(1),
(1999)
[CPY96] M.S. Chen, J.S. Park, P.S. Yu, Data Mining for Path Traversal Patterns in a
Web Environment, in Proc. of the 16th Intl. Conference on Distributed Computing
Systems (1996)
[CTI] CTI DePaul web server data, http://maya.cs.depaul.edu/~classes/ect584/data/ctidata.zip
[DBN] DB-NET web server data, http://www.db-net.aueb.gr
[DK04] M. Deshpande, G. Karypis, Selective Markov Models for Predicting Web-Page
Accesses, in ACM Transactions on Internet Technology, 4(2):163-184, (2004)
[DM02] H. Dai, B. Mobasher, Using Ontologies to Discover Domain-Level Web Usage
Profiles, in Proc. of the 2nd Workshop on Semantic Web Mining, Helsinki, Finland
(2002)
[E04] M. Eirinaki, Web Mining: A Roadmap, IST/NEMIS Technical Report, 2004,
http://www.db-net.aueb.gr
[EK+98] M. Ester, H. P. Kriegel, J. Sander, M. Wimmer, X. Xu, Incremental Clustering
for Mining in a Data Warehousing Environment, in Proc. of the 24th VLDB
Conference (1998)
[EL+04] M. Eirinaki, C. Lampos, S. Paulakis, M. Vazirgiannis, Web Personalization
Integrating Content Semantics and Navigational Patterns, in Proc. of ACM WIDM
2004, Washington D.C. (2004)

97

[EM+06] M. Eirinaki, D. Mavroeidis, G. Tsatsaronis, M. Vazirgiannis, Semantic Web


Personalization: The role of Ontologies, book chapter in Semantics, Web, and
Mining, (working title), eds. M. Ackermann, B. Berendt, M. Grobelnik, A. Hotho, D.
Mladenic, G. Semeraro, M. Spiliopoulou, G. Stumme, V. Svatek, M. van Someren, to
be published by Springer, LNCS/LNAI
[EV03] M. Eirinaki, M. Vazirgiannis, Web Mining for Web Personalization, in ACM
Transactions on Internet Technologies (ACM TOIT), 3(1):1-27 (2003)
[EV05] M. Eirinaki, M. Vazirgiannis, Usage-based web personalization, in Proceedings
of 5th IEEE International Conference on Data Mining (ICDM 2005), Houston, Texas,
(2005)
[EVA05] M.Eirinaki, J.Vlahakis, S. S. Anand, An Integrated Web Personalization
Platform Based on Content Structures and Usage Behaviour, Book Chapter in
"Intelligent Techniques in Web Personalization", eds. B.Mobasher, S.S.Anand, LNAI
3169, pp. 272-288, Springer Verlag, (2005)
[EVK05] M. Eirinaki, M. Vazirgiannis, D. Kapogiannis, Web path recommendations
based on Page Ranking and Markov Models, in Proc. of ACM WIDM 2005, Bremen,
Germany (2005)
[EVV03] M.Eirinaki, M. Vazirgiannis, I. Varlamis, "SEWeP: Using Site Semantics and
a Taxonomy to Enhance the Web Personalization Process", in Proc. of the 9th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining
(KDD2003), Washington DC (2003)
[EGP02] P. Edwards, G.A. Grimnes, A. Preece, An Empirical Investigation for Learning
from the Semantic Web, in Proc. of the 2nd Semantic Web Mining Workshop, Helsinki,
Finland (2002)
[GKG05] J. Guo, V. Keselj, Q. Gao, Integrating Web Content Clustering into Web Log
Association Rule Mining, In Proc. of Canadian AI 2005 (2005)
[H02] T. Haveliwala, Topic-Sensitive PageRank, in Proc. of WWW2002 Conference,
Hawaii (2002)
[HBV02] M. Halkidi, Y. Batistakis, M. Vazirgiannis, Cluster Validity Methods: Part II,
SIGMOD Record, September 2002
[HEK03] S. Holland, M. Ester, W. Kiebling, Preference Mining: A Novel Approach on
Mining User Preferences for Personalized Applications, in Proc. of the 7th PKDD
Conference (2003)
[HF04] G. Hooker, M. Finkelman, Sequential Analysis for Learning Modes of Browsing,
in Proc. of the 6th WEBKDD Workshop, Seattle (2004)
[HG+02] T.H. Haveliwala, A. Gionis, D. Klein, P. Indyk, Evaluating Strategies for
Similarity Search on the Web, in Proc. of WWW11, Hawaii (2002)
[HLC05] Z. Huang, X. Li, H. Chen, Link Prediction Approach to Collaborative Filtering,
in Proc. of ACM JCDL05, (2005)

98

[HN+01] Zhexue Huang, Joe Ng, David W. Cheung, Michael K. Ng, Wai-Ki Ching, A
Cube Model for Web Access Sessions and Cluster Analysis, in Proc. of the 3rd
WEBKDD Workshop (2001)
[HN+03] M. Halkidi, B. Nguyen, I. Varlamis, M. Vazirgiannis, THESUS: Organizing
Web Documents into Thematic Subsets using an Ontology, VLDB journal, 12(4): 320332, (2003)
[JF+97] T. Joachims, D. Freitag, T. Mitchell, WebWatcher: A Tour Guide for the World
Wide Web, in Proc. of IJCAI97, (1997)
[JPT03] S. Jespersen, T.B. Pedersen, J. Thorhauge, Evaluating the Markov Assumption
for Web Usage Mining, in Proc. of ACM WIDM 2003, Lousiana, (2003)
[JZM04a] X. Jin, Y. Zhou, B. Mobasher, Web Usage Mining Based on Probabilistic
Latent Semantic Analysis, in Proc. of the 10th ACM SIGKDD Conference on
Knowledge Discovery and Data Mining (KDD'04), Seattle, (2004)
[JZM04b] X. Jin, Y. Zhou, B. Mobasher, A Unified Approach to Personalization based
on Probabilistic Latent Semantic Models of Web usage and Content, in Proceedings of
AAAI Workshop on Semantic Web Personalization (SWP04), (2004)
[JZM05] X. Jin, Y. Zhou, B. Mobasher, A Maximum Entropy Web Recommendation
System: Combining Collaborative and Content Features, in Proceedings of the ACM
SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'05), Chicago
(2005)
[Kij97] M. Kijima, Markov Processes for Stochastic Modeling, Chapman & Hall,
London, (1997)
[KS04] J. Kleinberg, M. Sandler, Using Mixture Models for Collaborative Filtering, in
Proc. of ACM Symposium on Theory of Computing (STOC04), (2004)
[KG90] M. Kendall, J.D.Gibbons, Rank Correlation Methods, Oxford University Press,
(1990)
[KHG03] S.D. Kamvar, T.H. Haveliwala, and G.H. Golub, Adaptive Methods for the
Computation of PageRank, in Proc. of the Intl. Conference on the Numerical Solution
of Markov Chains, (2003)
[KH+03] S.D. Kamvar, T.H. Haveliwala, C.D. Manning, and G.H. Golub, Extrapolation
Methods for Accelerating PageRank Computations, in Proc. of the 12th Intl. World
Wide Web Conference, (2003)
[KJ+01] R. Krishnapuram, Anupam Joshi, Olfa Nasraoui, Liyu Yi, Low-Complexity
Fuzzy Relational Clustering Algorithms for Web Mining, in IEEE Transactions of
Fuzzy Systems, (2001)
[LE+04] C. Lampos, M. Eirinaki, D. Jevtuchova, M. Vazirgiannis, Archiving the Greek
Web, in Proc. of the 4th Intl. Web Archiving Workshop (IWAW04), Bath, UK (2004)
[LL03] M. Levene, G. Loizou, Computing the Entropy of User Navigation in the Web, in
Intl. Journal of Information Technology and Decision Making, 2: 459-476, (2003)
[MD+00a] B. Mobasher, H. Dai, T. Luo, Y. Sung, J. Zhu, Discovery of Aggregate Usage
Profiles for Web Personalization, in Proc. of 2nd WEBKDD Workshop, Boston (2000)

99

[MD+00b] B. Mobasher, H. Dai, T. Luo, Y. Sung, J. Zhu, Integrating web usage and
content mining for more effective personalization, in Proc. of the Intl. Conference on
Ecommerce and Web Technologies (ECWeb), Greenwich, UK (2000)
[ML+04] R. Meo, P.L. Lanzi, M. Matera, R. Esposito, Integrating Web Conceptual
Modeling and Web Usage Mining, in Proc. of the 6th WEBKDD Workshop, Seattle
(2004)
[MPG03] E. Manavoglu, D. Pavlov, C.L. Giles, Probabilistic User Behaviour Models, in
Proc. of the 3rd Intl. Conference on Data Mining (ICDM 2003)
[MPT99] F. Masseglia, P. Poncelet, M. Teisseire, Using Data Mining Techniques on Web
Access Logs to Dynamically Improve Hypertext Structure, in ACM SigWeb Letters,
8(3):13-19, (1999)
[MR95] R. Motwani and P. Raghavan. Randomized Algorithms, Cambridge University
Press, United Kingdom (1995)
[MSN] msnbc.com Web Log Data, available from UCI KDD Archive,
http://kdd.ics.uci.edu/databases/msnbc/msnbc.html
[MSR04] S. E. Middleton, N. R. Shadbolt, D. C. De Roure, Ontological User Profiling in
Recommender Systems, ACM Transactions on Information Systems (TOIS), 22(1):5488 (2004)
[MTV04] D. Mavroeidis, G. Tsatsaronis, M. Vazirgiannis, Semantic Distances for Sets of
Senses and Applications in Word Sense Disambiguation, in Proc. of the 3rd Intl.
Workshop on Text Mining and its Applications, Athens, Greece (2004)
[MT+05] D. Mavroeidis, G. Tsatsaronis, M. Vazirgiannis, M. Theobald, G. Weikum,
Word Sense Disambiguation for Exploiting Hierarchical Thesauri in Text
Classification, in Proc. of the ECML/PKDD 2005 Conference, Porto, Portugal (2005)
[NC+03] O. Nasraoui, C. Cardona, C. Rojas, F. Gonzales, Mining Evolving User Profiles
in Noisy Web Clickstream Data with a Scalable Immune System Clustering Algorithm,
in Proc. of the 5th WEBKDD Workshop, Washington DC (2003)
[NM02] A. Nanopoulos, Y. Manolopoulos, Efficient Similarity Search for Market Basket
Data, in the VLDB Journal, (2002)
[NM03] M. Nakagawa, B. Mobasher, A Hybrid Web Personalization Model Based on
Site Connectivity, in Proc. of the 5th WEBKDD Workshop, Washington DC (2003)
[NP03] O. Nasraoui, C. Petenes, Combining Web Usage Mining and Fuzzy Inference for
Website Personalization, in Proc. of the 5th WEBKDD Workshop, Washington DC
(2003)
[NP04] O. Nasraoui, M. Pavuluri, Complete this Puzzle: A Connectionist Approach to
Accurate Web Recommendations based on a Committee of Predictors, in Proc. of the
6th WEBKDD Workshop, Seattle, (2004)
[OB+03] D. Oberle, B. Berendt, A. Hotho, J. Gonzalez, Conceptual User Tracking, in
Proc. of the 1st Atlantic Web Intelligence Conf. (AWIC), (2003)
[P80] M. F. Porter, An algorithm for suffix stripping, Program, 14(3):130-137, (1980)

100

[PE00] M. Perkowitz, O. Etzioni, Towards Adaptive Web Sites: Conceptual Framework


and Case Study, in Artificiall Intelligence 118(1-2):245-275, (2000)
[PG02] N. Polyzotis, M. Garofalakis, Structure and Value Synopses for XML Data
Graphs, in Proc. of the 28th VLDB Conference (2002)
[PGI04] N. Polyzotis, M. Garofalakis, Y. Ioannidis, Approximate XML Query Answers,
in Proc. of SIGMOD 2004, Paris, France (2004)
[PL+04] S. Paulakis, C. Lampos, M. Eirinaki, M. Vazirgiannis, SEWeP: A Web Mining
System supporting Semantic Personalization, demo paper, in Proc. of the
ECML/PKDD 2004 Conference, Pisa, Italy (2004)
[PW00] T. Phelps, R. Wilensky, Robust hyperlinks: Cheap, Everywhere, Now, in Proc. of
Digital Documents and Electronic Publishing (DDEP00), Munich, Germany (2000)
[RD02] M. Richardson, P. Domingos, The Intelligent Surfer: Probabilistic Combination
of Link and Content Information in PageRank, in Neural Information Processing
Systems, 14:1441-1448, (2002)
[S00] R. R. Sarukkai, Link Prediction and Path Analysis Using Markov Chains, in
Computer Networks, 33(1-6): 337-386, (2000)
[SA95] R. Srikant, R. Agrawal, Mining Generalized Association Rules, in Proc. of 21st
VLDB Conf., Zurich, Switzerland (1995)
[SB98] G. Salton, C. Buckley, Term-weighting approaches in automatic text retrieval,
Information Processing and Management, 24:513-523, (1998)
[SC+00] J. Srivastava, R. Cooley, M. Deshpande, P. Tan, Web Usage Mining: Discovery
and Applications of Usage Patterns from Web Data, SIGKDD Explorations, 1(2):1223, (2000)
[SH03] R. Sen, M. Hansen, Predicting a Web users next access based on log data, in
Journal of Computational Graphics and Statistics, 12(1):143-155, (2003)
[SFW99] M. Spiliopoulou, L. C. Faulstich, K. Wilkler, A data miner analyzing the
navigational behaviour of Web users, in Proc. of the Workshop on Machine Learning
in User Modelling, Greece (1999)
[SK+00] B. Sarwar, G. Karypis, J. Konstan, J. Riedl, Analysis of Recommendation
Algorithms for E-Commerce, in Proc. of ACM EC00, Minnesota (2000)
[SK+01] B. Sarwar, G. Karypis, J. Konstan, J. Riedl, Item-based Collaborative Filtering
Recommendation Algorithms, in Proc. of WWW10, Hong Kong (2001)
[SZ+97] C. Shahabi, A. M. Zarkesh, J. Adibi, V. Shah, Knowledge Discovery for Users
Web-Page Navigation, in Workshop on Research Issues in Data Engineering,
Birmingham, UK (1997)
[VEA03] J. Vlahakis, M. Eirinaki, S. S. Anand, IKUM: An Integrated Web
Personalization Platform Based on Content Structures and Usage Behaviour, in Proc.
of the IJCAI-03 Workshop on Intelligent Techniques for Web Personalization
(ITWP'03), Acapuclo, Mexico (2003)

101

[VV+04] I. Varlamis, M. Vazirgiannis, M. Halkidi, B. Nguyen, THESUS, A Closer View


on Web Content Management Enhanced with Link Semantics, in IEEE Trans. on
Knowledge and Data Engineerign Journal (TKDE), 16(6):685-700, (2004)
[Tria] M. Triantafillidis, Triantafillidis On-Line, Modern Greek Language Dictionary,
http://kastor.komvos.edu.gr/dictionaries/dictonline/DictOnLineTri.htm
[W3Clog] Extended Log File Format, http://www.w3.org/TR/WD-logfile.html
[WCA] Web Characterization Terminology & Definitions,
http://www.w3.org/1999/05/WCA-terms/
[WC+02] J. Wang, Z. Chen, L. Tao, W. Ma, L. Wenyin, Ranking Users Relevance to a
Topic through Link Analysis on Web Logs, in Proc. of WIDM 02, (2002)
[WN] WordNet, A lexical database for the English language,
http://www.cogsci.princeton.edu/~wn/
[WP94] Z. Wu, M. Palmer, Verb Semantics and Lexical Selection, 32nd Annual Meetings
of the Associations for Computational Linguistics (1994)
[YH03] A. Ypma, T. Heskes, Categorization of web pages and user clustering with
mixtures of Hidden Markov Models, in Proc. of 4th WEBKDD Workshop, Canada
(2002)
[YZ+96] T. W. Yan, M. Jacobsen, H. Garcia-Mollina, U. Dayal, From User Access
Patterns to Dynamic Hypertext Linking, In 5th Intl. World Wide Web Conference
(WWW5), Paris, France (1996)
[ZB04] Q. Zhao, S. S. Bhowmick, Mining History of Changes to Web Access Patterns, in
Proc. of the 8th European Conference on Principles and Practice of Knowledge
Discovery in Databases (PKDD 2004), Pisa, Italy (2004)
[ZHH02a] J. Zhu, J. Hong, J. G. Hughes, Using Markov Chains for Link Prediction in
Adaptive Web sites, in Proc. of the 1st Intl. Conference on Computing in an Imperfect
World (2002)
[ZHH02b] J. Zhu, J. Hong, J. G. Hughes, Using Markov Models for Web Site Link
Prediction, in Proc. of ACM HT02, Maryland (2002)
[ZXH98] O. R. Zaiane, M. Xin, J. Han, Discovering Web Access Patterns and Trends by
Applying OLAP and Data Mining Technology on Web Logs, in Proc. of Advances in
Digital Libraries Conference (ADL'98), Santa Babara, CA (1998)

103

APPENDIX A
In order to evaluate the usefulness of the SEWeP framework, we presented the users
with three paths, each one having a different objective; one (A) containing visits to
contextually irrelevant pages (random surfer), a second (B) including a small path to very
specialized pages (information seeking visitor), and a third one (C) including visits to
top-level, yet research-oriented pages (topic-oriented visitor). This is the 2nd blind test
that was presented to the users, evaluating the usability of original vs. hybrid
recommendations. Note that we presented unlabeled recommendation sets to the users.

Path A
http://www.db-net.aueb.gr/people.htm
http://www.db-net.aueb.gr/links.htm
http://www.db-net.aueb.gr/courses/courses.htm ()
Recommendations

A.1 (HYBRID)
http://www.db-net.aueb.gr/pubs.php
http://www.db-net.aueb.gr/research.htm
http://www.db-net.aueb.gr/courses/postgrdb/asilomar.html
A.2 (ORIGINAL)
http://www.db-net.aueb.gr/pubs.php
http://www.db-net.aueb.gr/pubsearch.php
http://www.db-net.aueb.gr/research.htm

104

Path B
http://www.db-net.aueb.gr/people/michalis.htm
http://www.db-net.aueb.gr/mhalk/CV_maria.htm ()
Recommendations

B.1 (HYBRID)
http://www.db-net.aueb.gr/mhalk/Publ_maria.htm
http://www.db-net.aueb.gr/research.htm
http://www.db-net.aueb.gr/magda/papers/webmining_survey.pdf
B.2 (ORIGINAL)
http://www.db-net.aueb.gr/mhalk/Publ_maria.htm
http://www.db-net.aueb.gr/papers/gr_book/Init_frame.htm
http://www.db-net.aueb.gr/papers/gr_book/Contents.htm

Path C
http://www.db-net.aueb.gr/index.php
http://www.db-net.aueb.gr/research.htm
http://www.db-net.aueb.gr/people.htm ()
Recommendations

C.1 (ORIGINAL)
http://www.db-net.aueb.gr/projects.htm
http://www.db-net.aueb.gr/courses/courses.htm
http://www.db-net.aueb.gr/courses/courses.php?ancid=dm
C.2 (HYBRID)
http://www.db-net.aueb.gr/projects.htm
http://www.db-net.aueb.gr/courses/courses.htm
http://www.db-net.aueb.gr/courses/POSTGRDB/ballp.pdf

105

APPENDIX B
We present here the top-10 ranked paths generated using the Start and Total setups
(in Tables 6 and 7 respectively), as well as the 10 most frequent ones (in Table 5),
extracted from the test data set we used in our experiments for the h-PPM framework.
We observe that the rankings of the first two approaches represent the visits of students to
course material. We assume that in that period of time, when the data set was collected,
many students visited the same pages and paths, accessing them directly (probably by a
bookmarked page). Therefore, their visits dominated any other path visited by any other
user. On the other hand, by taking into consideration the objective importance of a
page, as denoted by the link structure of the web site, such temporal influence is reduced.
We omit the top-10 ranked paths generated using the PR, SUPR and UPR algorithms,
since they are very similar to the Frequent paths ranking, as shown by the experimental
results (Figure 28).

Table 5. Top-10 Frequent Paths


/news/default.asp /courses/
/authenticate/login.asp?section=mycti&title=mycti&urlahead=studentprofile/studentprofile
/cti/studentprofile/studentprofile.asp?section=mycti
/news/default.asp /people/
/courses/ finish
/news/default.asp /courses/ finish
/courses/ /courses/syllabilist.asp
/cti/advising/login.asp -> /cti/advising/display.asp?page=intranetnews
/news/default.asp /courses/ /courses/syllabilist.asp
/news/default.asp /programs/
/people/ /people/search.asp

106

Table 6. Top-10 ranking for Start setup


/news/default.asp /courses/syllabus.asp?course=250-97-802&q=2&y=2002&id=251
/news/default.asp /courses/syllabus.asp?course=250-97-802&q=2&y=2002&id=251
/courses/syllabilist.asp
/news/default.asp /courses/syllabus.asp?course=312-99-601&q=3&y=2002&id=263
/news/default.asp /courses/syllabus.asp?course=312-99-601&q=3&y=2002&id=263 finish
/news/default.asp /courses/syllabus.asp?course=318-21-601&q=3&y=2002&id=495
/news/default.asp /courses/syllabus.asp?course=318-21-601&q=3&y=2002&id=495 /news/
/news/default.asp /courses/syllabus.asp?course=345-21-901&q=3&y=2002&id=351
/news/default.asp /courses/syllabus.asp?course=345-21-901&q=3&y=2002&id=351 finish
/news/default.asp /courses/syllabus.asp?course=364-98-601&q=3&y=2002&id=921
/news/default.asp /courses/syllabus.asp?course=364-98-601&q=3&y=2002&id=921
/courses/syllabus.asp?course=463-98-301&q=3&y=2002&id=323

Table 7. Top-10 ranking for Total setup


/courses/ /courses/syllabus.asp?course=224-21-601&q=3&y=2002&id=561
/courses/ /courses/syllabus.asp?course=224-21-601&q=3&y=2002&id=561
/courses/syllabus.asp?course=224-21-901&q=3&y=2002&id=214
/courses/ /courses/syllabus.asp?course=224-21-601&q=3&y=2002&id=561
/courses/syllabus.asp?course=224-21-901&q=3&y=2002&id=214
/courses/syllabus.asp?course=224-21-902&q=3&y=2002&id=230
/courses/ /courses/syllabus.asp?course=224-21-601&q=3&y=2002&id=561
/courses/syllabus.asp?course=224-21-901&q=3&y=2002&id=214
/courses/syllabus.asp?course=224-21-902&q=3&y=2002&id=230
/courses/syllabus.asp?course=224-21-903&q=3&y=2002&id=250 finish
/courses/ /courses/syllabus.asp?course=224-21-903&q=3&y=2002&id=250
/courses/ /courses/syllabus.asp?course=224-21-903&q=3&y=2002&id=250 finish
/courses/ /courses/syllabus.asp?course=309-21-903&q=3&y=2002&id=198
/courses/ /courses/syllabus.asp?course=309-21-903&q=3&y=2002&id=198 finish
/courses/ /courses/syllabus.asp?course=311-98-601&q=3&y=2002&id=921
/courses/ /courses/syllabus.asp?course=372-98-901&q=3&y=2002&id=326

Вам также может понравиться