Вы находитесь на странице: 1из 26

QueRIE: Collaborative

Database Exploration

Abstract

Relational database users employ a query


interface (typically, a web-based client) to
issue a series of SQL queries that aim to
analyse the data and mine it for interesting
information.
First-time users may not have the necessary
knowledge to know where to start their
exploration.
Other times, users may simply overlook
queries that retrieve important information.
In this work we describe a framework to assist
non-expert users by providing personalized
query recommendations.

Literature Survey

Web based browsers are used like

Genome (http://genome.ucsc.edu/)
Sky Server (http://cas.sdss.org/)

Personalized recommendations for keyword or


free-form query interfaces
A multidimensional query recommendation
system : address the problem of generating
recommendations for data warehouses and
OLAP systems
Recommendation based on past queries using
the most frequently appearing tuple values.

Literature Survey Continued...

1.Hive - A petabyte scale data warehouse using hadoop(A.


Thusoo et al.):
Hadoop is a popular open-source map-reduce implementation
which is being used in companies like Yahoo, Facebook etc. to
store and process extremely large data sets on commodity
hardware. However, the map-reduce programming model is
very low level and requires developers to write custom
programs which are hard to maintain and reuse. Hive, an opensource data warehousing solution built on top of Hadoop.Hive
supports queries expressed in a SQL-like declarative language HiveQL, which are compiled into map-reduce jobs that are
executed using Hadoop.

Literature Survey Continued...


2.QueRIE: A recommender system
supporting interactive database
exploration(S. Mittal, J. S. V. Varman):
This demonstration presents QueRIE, a
recommender system that supports
interactive database exploration. This system
aims at assisting non-expert users of
scientific databases by generating
personalized query recommendations.
Drawing inspiration from Web recommender

Literature Survey Continued...


3.Amazon.com recommendations: Item-toitem collaborative filtering(G. Linden, B.
Smith, and J. York):
At Amazon.com, we use recommendation
algorithms to personalize the online store for
each customer. The store radically changes
based on customer interests, showing
programming titles to a software engineer
and baby toys to a new mother. There are
three common approaches to solving the
recommendation problem: traditional

Literature Survey Continued...


Personalized queries under a generalized
preference model(G. Koutrika and Y.
Ioannidis):
In this paper, we present a preference model
that combines expressivity and concision. In
addition, we provide efficient algorithms for
the selection of preferences related to a
query, and an algorithm for the progressive
generation of personalized results, which are
ranked based on user interest. Several
classes of ranking functions are provided for

Proposed Solution

The basic idea behind this project is:

To use user query log to analyze session


summary
Based on session summary generate the
target tuples
Generate recommended queries retrieving
target tuples.
Re-ranking based on clarity scores

Architecture

Re-Ranking
based on KLDiversion

Methodology/ Implementation
Details

Fragment Based Recommendation

Session summaries
Recommendation seed computation
Generation of query recommendations

Query Processing

Query Relaxation
Query Parsing

Result Re-Ranking based on Clarity Scores

K-L Diversion Method

Fragment Based
Recommendations

Session Summary:
The session summary vector Si for a user i consists
of all the query fragments of the users past
queries.
Let Qi represent the set of queries posed by user i
during a session
F represent the set of all distinct query fragments
recorded in the query logs.
We assume that the vector SQ represents a single
query Q Qi. For a given fragment F, we define
SQ[] as a binary variable that represents the
presence or absence of in a query Q.
Then Si[] represents the importance of in session
Si.

Fragment Based
Recommendations

Recommendation seed computation:

To generate recommendations, the framework


computes a predicted summary S captures the
predicted degree of interest of the active user S
serves as the seed for the generation of
recommendations.
The predicted summary is defined as follows

mixing factor [0, 1] that determines the importance


of the active users queries
Using the session summaries of the past users and a
vector similarity metric, we construct the (|F| x |F|)
fragment-fragment matrix that contains all
similarities sim(, ) , F.

Predicted Summary
Computation

Generation of Query
Recommendation

Once the predicted summary Spred has


been computed, the top-n fragments Fn
(i.e. the fragments that have received the
higher weight) are selected.
Then all past queries Q, Q U Qi receive a
rank QR with respect to the top-n
fragments:

Query Relaxation

Because of the plethora of slightly


dissimilar queries existing in the query
logs, we decided to relax them in order to
increase their cardinality, and thus the
probability of finding similarities between
different user sessions.

Query Parsing

Our Contribution...
KullbackLeibler divergence Theorem:
KL divergence is a special case of a broader
class of divergences called f-divergences. It
was originally introduced by Solomon
Kullback and Richard Leibler in 1951 as the
directed divergence between two
distributions. It can be derived from a
Bregman divergence.

Our Contribution...
In words, it is the expectation of the
logarithmic difference between the
probabilities P and Q, where the expectation
is taken using the probabilities P. The KL
divergence is only defined if P and Q both
sum to 1 and if Q(i)=0 implies P(i)=0 for all i
(absolute continuity). If the quantity 0 ln 0
appears in the formula, it is interpreted as
zero because
where p and q denote the densities of P and Q.
For distributions
P and Q of a continuous

Our Contribution...
More generally, if P and Q are probability
measures over a set X, and P is absolutely
continuous with respect to Q, then the
KullbackLeibler divergence from P to Q is
defined as
where dp/dq is the RadonNikodym
derivative of P with respect to Q, and
provided the expression on the right-hand
side exists. Equivalently, this can be written

Our Contribution...
which we recognize as the entropy of P
relative to Q. Continuing in this case, if Mue
is any measure on X for which
and
exist, then the KL divergence from P to
Q is given as

Our Contribution...
Agglomerative Clustering Algorithm
The algorithm forms clusters in a bottom-up manner, as follows:
1.Initially, put each article in its own cluster.
2.Among all current clusters, pick the two clusters with the smallest
distance.
3.Replace these two clusters with a new cluster, formed by merging the
two original ones.
4.Repeat the above two steps until there is only one remaining cluster in
the pool.
Thus, the agglomerative clustering algorithm will result in a binary
cluster tree with single article clusters as its leaf nodes and a root node
containing all the articles.In the clustering algorithm, we use a distance
measure based on log likelihood. For articles A and B, the distance is
defined as

Our Contribution...
The log likelihood LL(X) of an article or cluster X is given by a unigram
model:

Here, cx(w) and px(w) are the count and probability, respectively, of
word w in cluster X, and Nx is the total number of words occurring in
cluster X.Notice that this definition is equivalent to the weighted
information loss after merging two articles:

Where

To avoid expensive log likelihood recomputation after each cluster


merging step, we define the distance between two clusters with multiple
articles as the maximum pairwise distance of the articles from the two
clusters:

Our Contribution...

where C1 and C2 are two clusters, and A, B are articles from C1 and
C2 , respectively.Once a cluster tree is created, we must decide where to
slice the tree to obtain disjoint partitions for building cluster-specific LMs.
This is equivalent to choosing the total number of clusters. There is a
tradeoff involved in this choice. Clusters close to the leaves can maintain
more specifics of the word distributions. However, clusters close to the
root of the tree yield LMs with more reliable estimates, because of the
larger amount of data.We roughly optimized the number of clusters by
evaluating the perplexity of the Hub4 development test set. We created
sets of 1, 5, 10, 15, and 20 article clusters, by slicing the cluster tree at
different points. A backoff trigram model was built for each cluster, and
interpolated with a trigram model derived from all articles for smoothing,
to compensate for the different amounts of training data per cluster.

Our Contribution...

and P(LMi) and P(LMi | A) are the prior and posterior cluster
probabilities, respectively.In training, A is the reference transcript for one
story from the Hub4 development data. During testing, A is the 1-best
hypothesis for the story, as determined using the standard LM.

Our Contribution...
Re-ranking based on clarity score
Reranking algorithms can mainly be categorized into two
approaches: Pseudo relevance feedback and Graph-based
reranking.
Pseudo relevance feedback approach display top results as relevant
samples and then collects some samples that are assumed to be
irrelevant.
Graph-based reranking approach usually follows two assumptions.
First, the disagreement between the initial ranking list and the refined
ranking list should be small. Second, approach constructs a graph
where the vertices are images or videos and the edges reflect their
pair wise similarities.

Thank You.

Вам также может понравиться