Вы находитесь на странице: 1из 10

CS615: Group 11

Probabilistic Skylines with Incomplete Data


Guide: Prof. Arnab Bhattacharya
Utsav Sinha Saptarshi Gan
15 12
12775 12624
uutsav@iitk.ac.in sapgan@iitk.ac.in
Dept. of CSE Dept. of CSE
Indian Institute of Technology, Kanpur
Final Report
25th November, 2015

Abstract In this project, we propose to combine incomplete data


problem with probabilistic settings - finding skylines in
Skyline queries have received substantial focus in the datasets which have missing dimensions such that the
database research field in the recent past. A skyline tuple itself has an existence probability. Mathematically,
point is a data entry which is not dominated by any other
point in the dataset [BKS01]. Recent developments have Let there be n data tuples in an m dimensional space.
explored skylines with incomplete data, skylines with Let the tuples be O1 , O2 , ..., On and the dimensions
user preferences and skyline with extensions on uncer- be D1 , D2 , ..., Dm . Each tuple Oi has an existential
tain databases. This project attempts to investigate sky- probability ei , this is the probability with which this tuple
line queries on uncertain datasets with incomplete data - a exists in the dataset. Each tuple has Mj missing data
model which have not been looked into before. dimensions, where 0 Mj m.

Our task is to find the skyline points in this dataset.


1 Introduction
Most real-world data has incomplete fields (dimensions) 1.1 Motivation
in a data point - like each movie will not have rating from
all reviewers in a movie database. To tackle this sparsity To emphasize why solving this task is important, let us
of data, comparison may only takes place in dimensions take an example. You are planning a tour for tomorrow
in which both the entries have finite values. Applying and have several possible destinations in mind. There are
traditional skyline query algorithms for complete data is many buses to each destination and each may leave/may
not possible here due to non-transitivity and a possibility not leave given the weather conditions tomorrow. More-
of cyclic dominance relation [KML08], see table 1here. over, at each destination, there are certain tourist spots
that you want to visit and you have ranked each of them.
Moreover, all data entries in a dataset may not exist with Each of those tourist spots has some amenities like
absolute certainty, there can be a probability associated restaurants, hotels, market but all kinds of amenities are
with the existence of the data. For example, a flight may not present at each spot. There is a probability associated
depart at 8am, it may get canceled (existential uncer- with each facility - the shops may be closed, the hotels
tainty) or it might depart late (locational uncertainty). may be booked or restaurant may be closed. You would
So skyline computation must also take the respective like to go to places which have less costly hotels, good
probabilities into consideration. restaurants and be away from crowded markets. This
query is none other than the skyline query which gives

1
O1 O2 O3 O4 Objects O2 and O3 just have D2 in common which
D1 1 5 NULL 3 does not have missing data. Here, O22 < O32 and
D2 NULL 4 6 1 so O2 dominates O3 . Similarly, it can be shown that
D3 8 NULL 3 2 O4 O3 , O4 O2 and O3 O1 . Here, the cycle of
dominance which can happen in case of incomplete data
e 0.4 0.3 0.6 0.8
is shown: O1 O2 , O2 O3 and O3 O1
PSKY 0.4 0.12 0.14 1
PSKYO1 = (1 e3 ) = 0.4
Table 1: Example of Uncertain Incomplete Data PSKYO2 = (1 e1 ) (1 e4 ) = 0.6 0.2 = 0.12
PSKYO3 = (1 e2 ) (1 e4 ) = 0.7 0.2 = 0.14
feasible options to the user. In this complete scenario, PSKYO4 = 1
there is incomplete data in the form of missing amenities So with = 0.30, O1 and O4 are skylines.
and all data is present in a probabilistic setting. So our
approach to find skyline points in this domain would offer For experimentation, since readily available dataset as per
optimal solutions. our needs was not available, we synthetically generated
points for independent, correlated and anti-correlated data
using http://pgfoundry.org/projects/randdataset. Then we
made certain dimensions null (missing) randomly and
1.2 Problem Formulation tried our experiments for different densities of incomplete
As outlined in section 1, there are n tuples O1 , O2 , ..., On data. We have tried two different methods of generating
and m dimensions D1 , D2 , ..., Dm . The j th dimension existential probabilities of the data - a uniform random
of tuple Oi is denoted by Oij . Each tuple Oi has an ex- distribution and a normal distribution with mean 0.5 and
istential probability eOi and has Mk missing dimensions, standard deviation 0.2.
where 0 k m. Fill NULL values on all these Mk
dimensions for each object.
2 Related Work
Definition 1 (Dominance Relation). An object Oi is said [KML08] explores incomplete data and the Iskyline
to dominate an object Oj , denoted by Oi Oj , if Oik algorithm which reduces the number of exhaustive
Ojk k, 0 k m such that both Oik 6= N U LL and comparisons of bucket algorithm using virtual points and
Ojk 6= N U LL and k 0 where Oik0 < Ojk0 . shadow skylines. [BK13] devised the Sort-based Incom-
plete Data Skyline (SIDS) algorithm which improves
According to this dominance relation, two objects are
upon the efficiency of the Iskyline algorithm. Approaches
only compared on their common dimensions, the dimen-
based on completing the incomplete dataset using inter-
sions in which both of them have non NULL values.
polation have also been explored in [ZLOT10]. But since
Definition 2 (Skyline Probability). Each object Oi has filling in missing values is problematic, particularly if
a probability PSKYOi associated with it which calculates there is a high amount of data sparsity or tolerance to
the chancesQof this object to be a skyline. false positives is low, so we would not pursue this line
PSKYOi = j,Oj Oi (1 eOj ) any further.

Definition 3 (Skyline Set). The skyline set is defined as


[PJLY07] discusses p-skyline which contains all tuples
the set of objects whose skyline probability is above a
whose skyline probabilities are at least p. [AQ09] takes
threshold .[PJLY07]. a different approach of grid based space partitioning
SKY = {Oi |PSKYOi > } bypassing the thresholding, that is fixing a lower bound
Definition 4 (Shadowed Object). Each object Oi whose on acceptable skyline probability.
probability PSKYOi < is a shadowed object.
An example to illustrate the problem formulation is in
table 1. 3 Approach
Objects O1 ans O2 have only D1 as common non NULL To the best of our knowledge, no previous work along
dimension on which O11 < O21 . Hence, O1 O2 . probabilistic skylines on incomplete data has been

2
pursued. We have continued along the lines where Algorithm 1 Exclusive Filtering
incomplete data is tackled with comparisons only in Input: Dataset D
dimensions where both the entries have finite values as Output: skylines set SKY
defined in dominance relation 1. 1: SKY = , ShadowedSet = , ComparedSet =
2: for each object Oi D do
As a baseline, a naive algorithm is implemented which 3: for each object Oj 6= Oi D do
compares all objects with every other object and computes 4: if Oi , Oj not compared before then
the skyline probability of each object. Those objects Oi 5: if Oi Oj then
whose PSKYOi < are pruned while remaining objects 6: Update PSKYOj
are output as the skyline set SKY . 7: else if Oj Oi then
8: Update PSKYOi
Two different algorithms are studied while a third algo- 9: end if
rithm is outlined as future work. 10: end if
11: if PSKYOj < then
12: Remove Oj from D and insert in
ShadowedSet
3.1 Exclusive Filtering Skylines (EFS) 13: end if
The EFS algorithm optimizes over the naive in the same 14: Remove Oi from D and put in ComparedSet
way as SFS optimizes over BNL algorithm. In order to 15: end for
16: end for
decrease number of comparisons, we must avoid compar-
ing shadowed objects with each other. The EFS algorithm 17: for each object Oi ComparedSet do
is devised so that whenever the skyline probability of an 18: if PSKYOi > then
object falls below the threshold , the object is removed 19: if Oi has been compared n times then
from the dataset D. 20: Insert Oi in SKY
21: else
However, this object cannot be pruned entirely as it can 22: Compare Oi with ShadowedSet and up-
prune some other objects whose PSKY is yet above the date P SKY Oi

threshold. So these objects are inserted into shadowedSet. 23: if PSKYOi > then
After each object has been processed atleast once, then it 24: Insert Oi in SKY
is inserted into ComparedSet. The non shadowed objects 25: end if
are compared exhaustively with all shadowed objects 26: end if
from shadowedSet with which it was not compared 27: end if
28: end for
before. The skyline probability is updated accordingly
and the objects which remain above the threshold after all 29: return SKY
possible comparisons are inserted into the SKY set. The
algorithm is described in algorithm 1
Partition all objects into separate buckets where each
In order to check whether an object has been compared bucket contains objects which have the same non-NULL
with all objects in the shadowed set, we need to maintain dimensions [KML08]. We apply filtering on each bucket
a N 2 bit matrix which is not space efficient. So we will to find shadowed objects. These objects by definition are
keep a timestamp with each object and the timestamp not in the skyline set. So all non shadowed objects from
will be updated in the first check for only the object each bucket are possible candidates for being a global
which is being checked against the rest. In the 2nd round skyline.
of checking, the tuple is checked only if the shadows
timestamp is lesser than that of the object. Now compute entropy of each object based on all non-
NULL dimensions of this bucket.
Pm
E(Oi ) = k=1,Oik 6=N U LL (1 + Oik )
3.2 Bucketed-SFS algorithm
The complete algorithm is described in figure 2. Sort each bucket according to this entropy function. From
the Sort Filtering Skylines (SFS) [CGGL03], we know

3
Algorithm 2 Bucketed SFS
Input: Database D But when most of the buckets are empty, then we either
Output: skylines set SKY have a huge amount of missing data or too less amount
1: Initialize PSKYO with 1 for all objects of it since existing data are being accumulated in few
2: Partition data points in D into 2m buckets buckets. Here, the pruneSelf algorithm with the usual
{B1 , B2 , ..., B2m } based on NULL dimensions top-down iteration as in SFS is to be used since the pointer
3: for each bucket Bi do method is also going to result in O(sizeof (bucket)2 )
4: pruneSelf(Bi ) comparisons.
5: end for
6: for each bucket Bi do
7: for each bucket Bj 6= Bi do Algorithm 3 PruneSelf
8: mergeBuckets(Bi , Bj ) Input: Bucket Bi
9: end for 1: for each object Ok Bi do
10: end for 2: Calculate entropy in non NULL dimensions
11: for each bucket Bi do 3: end for
12: for each non shadowed object Ok Bi do 4: Sort Bucket based on entropy in non-decreasing order
13: for each bucket Bj 6= Bi do 5: Initialize parent pointer array to track dominators
14: for each shadowed object Ol Bj not 6: for k = 0 to Bi .size() 1 do
compared with Ok before do 7: Initialize bitmap to false
15: if Ok Ol then 8: for l = k 1 down to 0 do
16: PSKYOk PSKYOk (1 eOk ) 9: if parent pointer marked as dominator in
17: end if bitmap then
18: end for 10: avoid comparison and update PSKYOk
19: end for 11: else if Ol Ok then
20: end for 12: PSKYOk = PSKYOk (1 eOl )
21: if PSKYOk > then 13: Update bitmap to mark any dominator of
22: SKY.insert(Ok ) Ol as Ok s dominator
23: end if 14: end if
24: end for 15: end for
25: return SKY 16: Update parent pointer as the nearest object with
highest entropy which dominates Ok
17: end for
that if an object Oi Oj , then E(Oi ) < E(Oj ) and if 18: All objects in bucket Bi whose PSKY < are
E(Oi ) < E(Oj ), then Oj 6 Oi . marked as shadowed

So we start from the object Oi with lower entropy in a After this analysis, all objects whose skyline probability
bucket and iterate i downwards. An inner loop is run to fell below the threshold are marked as shadowed. Since
compare Oi with all objects having smaller entropy. Two they can no longer become a skyline object, so there is no
options are used based on the number of buckets filled. need to further refine their computation of PSKY .

When the degree of incompleteness is between 20% to Now buckets are taken pairwise and their non shadowed
70%, then most of the buckets are somewhat uniformly objects are compared. To make the execution faster, they
occupied. So algorithm 3 is applied which runs a loop are again sorted on their entropy value, but this time, the
backwards where each object Oi finds the nearest object entropy is calculated based on the common non NULL
Oj suct that Oj Oi and no other Ok dominating Oi dimensions of the two buckets. The pseudo code is in
exists where j < k < i. Since all objects in a bucket are algorithm 4. The only difference from pruneSelf is that
compared on the common dimensions so transitivity for we have to make sure that PSKY is not updated when
the dominance relation exist in the bucket. Therefore we an object is dominated by another object of the same
store a dominator pointer for each object which stores bucket while merging. This is because objects of the
this nearest object which dominates it and uses it to avoid same bucket were already compared and PSKY bounded
unnecessary comparisons. in pruneSelf.

4
Algorithm 4 MergeBuckets
Input: Bucket Bi , Bj
1: for each non shadowed object Ok Bi do
2: Calculate entropy based on common non NULL
dimensions in Bi , Bj
3: end for
4: for each non shadowed object Ok Bj do
5: Calculate entropy based on common non NULL
dimensions in Bi , Bj
6: end for
7: sort Bi and Bj based on entropy in non-decreasing
Figure 1: Ratio of Comparisons vs Dataset Size for 20%
order
Incomplete 10 dimensional Correlated Data
8: Emulate Merge process of merge-sort using sorted
lists Bi and Bj
9: for k = 0 to Bi .size() 1 do
10: l = 0 Denote Bi element by Ok and Bj element
by Ol
11: while Ol .entropy < Ok .entropy do
12: if Ol Ok then
13: PSKYOk = PSKYOk (1 eOl )
14: end if
15: l =l+1
16: end while
17: end for
18: for l = 0 to Bi .size() 1 do
19: k = 0 Denote Bi element by Ok and Bj element
by Ol Figure 2: Ratio of Comparisons vs Dataset Size for 20%
20: while Ok .entropy < Ol .entropy do Incomplete 10 dimensional Independent Data
21: if Ok Ol then
22: PSKYOl = PSKYOl (1 eOk ) algorithm two can be optimized if dimension based sort-
23: end if ing and early stoping is done. So instead of sorting by
24: k =k+1 entropy, we will sort by dimensions and store it for future
25: end while lookups. All dimensions will be considered in a round
26: end for
robin fashion to prune other objects and update its PSKY .
27: All objects in both buckets whose PSKY < are
This algorithm forms a part of our future investigation.
marked as shadowed

The space complexity of each call to pruneSelf is 4 Results


O(sizeof (bucket)) while combined space requirement
of all 2m calls to pruneSelf is O(n) since all objects The number of comparisons vs the size of the dataset is in
are considered exactly once in pruneSelf. The space figure 1, 2, 3 for 20% of data missing. The corresponding
complexity of MergeBuckets is O(sizeof (bucket1)) + figures for 60% missing data are 4, 5 and 6. The threshold
O(sizeof (bucket2)) and each bucket pair is called once. probability was kept at 0.001 for all the experiments
So total space complexity is again O(n) and the buckets ant the reported figures are using normal distribution for
are reused across calls. existential probability.

The naive algorithm has fixed number of comparisons


3.3 Dimension based sorting
equal to n (n 1)/2. In case of correlated dataset, the
Taking ideas from the k-skylines algorithm Sort-based In- EFS algorithm works well while it performs poorly for
complete Data Skyline (SIDS) as explained in [BK13], our anti-correlated data. Moreover, as the degree of missing

5
Figure 6: Ratio of Comparisons vs Dataset Size for 40%
Figure 3: Ratio of Comparisons vs Dataset Size for 20% Incomplete 10 dimensional Anti-Correlated Data
Incomplete 10 dimensional Anti-Correlated Data

data is increased, we see that number of comparisons of


EFS falls until half the dimensions are full. Increasing
missing data further increases the number of comparisons.

On the other hand, Bucketed-SFS performs almost at


par with EFS for small proportion of missing data, but
its performance increases upto 100 times of that of EFS
when more incomplete data is added. This happen until
missing data occupies around 70% of all data, beyond
which the number of comparisons start rising steadily.
The figures for number of comparisons vs percentage of
missing data are in 7, 8 and 9 for 10 dimensional data
with 10000 objects.

Figure 4: Ratio of Comparisons vs Dataset Size for 40% The reason for such sudden decrease in number of
Incomplete 10 dimensional Correlated Data comparisons is the evenly spreading out of the data into
all possible 2m buckets when more data is missing. When
less data is missing, then most of the data accumulates
in buckets whose indices are at the end - the indices
corresponding to most dimensions have non-NULL data.
So, this uniform spreading of data makes both pruneSelf
and MergeBuckets procedures call on many smaller sized
buckets instead of some larger sized buckets. The time
for separately sorting many smaller lists is also less than
their combined list.

But when the proportion of missing data crosses 70%,


then most objects start having many dimensions missing,
therby making then accumulate in buckets whose indices
are at the start. So again, the performance suffers due
to fewer large buckets. But since most of the buckets
themselves become incomparable, so the performance
Figure 5: Ratio of Comparisons vs Dataset Size for 40%
is not hindered as much as the case when there was less
Incomplete 10 dimensional Independent Data
missing data. In a nutshell, this entire process is similar
to the partition size in quicksort - with equally sized

6
Figure 7: Ratio of Comparisons vs Degree of Incompleteness for 10k 10 dimensional Correlated Data

Figure 8: Ratio of Comparisons vs Degree of Incompleteness for 10k 10 dimensional Independent Data

Figure 9: Ratio of Comparisons vs Degree of Incompleteness for 10k 10 dimensional Anti-Correlated Data

7
Figure 10: Ratio of Comparisons vs Number of Dimen- Figure 12: Ratio of Comparisons vs Number of Dimen-
sions for Correlated Data sions for Anti-Correlated Data

Figure 13: Ratio of Comparisons vs Threshold Probability


for 10k 10-dimensional 60% Incomplete Correlated Data
Figure 11: Ratio of Comparisons vs Number of Dimen-
sions for Independent Data
figures.

partitions, the performance is optimal. In higher number of dimensions, the performance of


Bucketed-SFS degrades more than EFS. This is due to
Moreover, the number of skylines also has a huge impact creation of 2m buckets where m is the number of dimen-
in the number of comparisons. In cases where number sions. There is calculation of entropy values after every
of skylines is 80% of all data, then comparison of two buckets are merged and all 2m C2 pairs of buckets are
these skylines with each other is bound to take place, considered. Also, sorting is performed after every merge
making the ratio of number of comparisons high. But operation which leads to considerable overhead when the
when the number of skylines is less, then both EFS and number of data points is less.
Bucketed-SFS saves on a lot of comparisons.
In order to see the effect of threshold probability on
The effect of changing the number of dimensions on number of comparisons, we varied it from 104 to 1.0 and
EFS and Bucketed-SFS algorithms for different types the number of skylines along with ratio of comparisons
of datasets is in figure 10, 11 and 12. With increase in with naive algorithm is in figures 13, 14 and 15. 10000
dimensions of the dataset, the performance decreases points with 10 dimensions and 60% missing data have
since more computation is required before declaring any been used here.
object as skyline/non skyline. Again, the effect of number
of skyline objects actually found is also evident from the Moreover to save time when number of dimensions is

8
#Dimensions
Percentage of #Dimensions
Data #Objects Compared compared/Objects
missing data compared
compared ratio
Correlated 0 185509661 47393466 3.91
Independent 0 149266285 49990682 2.99
Anti-correlated 0 131461868 49995000 2.63
Correlated 20 38849105 8721242 4.45
Independent 20 83140890 27902713 2.98
Anti-correlated 20 102804469 38038064 2.70

Table 2: Dimension comparison and Object comparison for Bucketed-SFS in 10000 sized 10 dimensional data

high, we apply an effective early comparison termination


technique - whenever two objects being compared
dominates each other in at least one dimension each, we
get to know that they are incomparable since none of
the two can dominate the other. So there is no further
use of comparing them on the remaining dimensions of
the query. The number of dimensions compared for two
objects was also recorded besides recording the number
of object to object comparisons and the corresponding
statistics for 10000 sized 10 dimensional data is in table 2.

Ideally, this ratio should be equal to the number of di-


mensions being compared which is 10 in our experiment.
But as can be seen, it is less than five in correlated,
independent and anti-correlated data. This means that on
Figure 14: Ratio of Comparisons vs Threshold Probabil-
an average, we are saving comparisons on five dimensions
ity for 10k 10-dimensional 60% Incomplete Independent
for any two objects. Moreover, for anti-correlated data, it
Data
is 2.5 which means that on a average, just by comparing
2 to 3 dimensions of the objects, we can decide whether
one dominates the other. Such lower value is possible
since the data is anti-correlated - higher in one dimension
implies lower in another dimension leading to early
declaration of objects being incomparable.

Another interesting insight is that Bucketed-SFS per-


forms far better not only in case of correlated data where
shadowed objects are populated rapidly, but also in
case of anti-correlated and independent data. This is
because Bucketed-SFS partitions data into buckets based
on missing data dimensions and is more dependent on
the uniformity of bucket distribution than the nature
of data. The nature of data decides the rate at which
shadowed objects will be formed, the more correlated
Figure 15: Ratio of Comparisons vs Threshold Probability the data, the more chance will be for any object to be
for 10k 10-dimensional 60% Incomplete Anti-Correlated dominated by another, thereby pruning (shadowing) it
Data earlier. So unnecessary comparisons of shadowed objects
are avoided earlier.

9
5 Conclusions and Future Work [BKS01] S Borzsony, Donald Kossmann, and Konrad
Stocker. The skyline operator. In Data En-
In this project, we have explored two new algorithms gineering, 2001. Proceedings. 17th Interna-
to tackle skyline queries in uncertain datasets with tional Conference on, pages 421430. IEEE,
incomplete data. 2001.

Bucketed-SFS outperforms EFS when there is a con- [CGGL03] Jan Chomicki, Parke Godfrey, Jarek Gryz,
siderable amount of missing data (> 20% of total and Dongming Liang. Skyline with presort-
data). On other cases, EFS performs slightly better. ing. In ICDE, volume 3, pages 717719,
2003.
Bucketed-SFS performs 20 times faster than naive
algorithm on average for all kinds of data. [KML08] Mohamed E Khalefa, Mohamed F Mokbel,
and Justin J Levandoski. Skyline query pro-
EFS algorithm is best suited for correlated data cessing for incomplete data. In Data En-
where as Bucketed-SFS is appropriate for all kinds gineering, 2008. ICDE 2008. IEEE 24th In-
data - it is virtually independent of the nature of the ternational Conference on, pages 556565.
data when compared with other two algorithms. IEEE, 2008.

[PJLY07] Jian Pei, Bin Jiang, Xuemin Lin, and Yi-


Avoiding the entropy calculation and sorting over-
dong Yuan. Probabilistic skylines on uncer-
head when size of merged lists fall below a certain
tain data. In Proceedings of the 33rd inter-
size can be done as an optimization in Bucketed-SFS.
national conference on Very large data bases,
The SIDS heuristic can be used in the Exclusive Fil- pages 1526. VLDB Endowment, 2007.
tering algorithm for early termination of query. [ZLOT10] Zhenjie Zhang, Hua Lu, Beng Chin Ooi, and
Anthony K Tung. Understanding the meaning
Extend the proposed algorithms to go from exis-
of a shifted sky: a general framework on ex-
tential probabilistic model to locational uncertainty
tending skyline query. The VLDB JournalThe
model.
International Journal on Very Large Data
The bottom up algorithm of p-skylines can be incor- Bases, 19(2):181201, 2010.
porated into the Iskyline algorithm.

Use an R-Tree indexed structure to store the objects


for logarithmic time query. The current methods are
also entirely memory based which needs to be ported
to a file based system.

References
[AQ09] Mikhail J Atallah and Yinian Qi. Com-
puting all skyline probabilities for uncertain
data. In Proceedings of the twenty-eighth
ACM SIGMOD-SIGACT-SIGART symposium
on Principles of database systems, pages
279287. ACM, 2009.

[BK13] Rahul Bharuka and P Sreenivasa Kumar.


Finding skylines for incomplete data. In Pro-
ceedings of the Twenty-Fourth Australasian
Database Conference-Volume 137, pages
109117. Australian Computer Society, Inc.,
2013.

10

Вам также может понравиться