Вы находитесь на странице: 1из 10

Range Selectivity Estimation for Continuous Attributes

Flip Korn AT&T Labs - Research Florham Park, NJ 07932


flip@research.att.com

Theodore Johnson AT&T Labs - Research Florham Park, NJ 07932


johnsont@research.att.com

H.V. Jagadish UIUC Urbana, IL 61801


jag@cs.uiuc.edu

Abstract
Many commercial database systems maintain histograms to efciently estimate query selectivities as part of query optimization. Most work on histogram design is implicitly geared towards discrete or categorical attribute value domains. In this paper, we consider approaches that are better suited for the continuous valued attributes commonly found in scientic and statistical databases. We propose two methods based on spline functions for estimating the selectivity of range queries over univariate and multivariate data. These methods are more accurate than histograms. As the results from our experiments on both real and synthetic data sets demonstrate, the proposed methods achieved substantially better (up to 5.5 times) estimation error than the state-of-the-art histograms, at exactly the same storage space and with comparable CPU runtime overhead; moreover, the superiority of the proposed spline methods is amplied when applied to multivariate data.

data are allowed to speak for themselves. Of the nonparametric methods we consider two approaches: histograms [8, 7, 9, 16, 14, 15] and curve-tting [18, 1]. We briey review some of the histogram methods (e.g., equiwidth, equidepth, end-biased, maxdiff) in Sec. 2.1; an excellent taxonomy of histograms can be found in [16]. Curve-tting approaches alternative to the proposed approach are considered in Sec. 6. Many data sets have continuous valued attributes such as scientic and statistical data sets. The state-of-the-art histograms [9, 16, 15] are implicitly geared towards discrete or categorical attribute value domains where there are relatively few distinct values in the attribute. As such, these methods can and are also used for estimating join selectivities [9]. In the absence of many duplicate values, as is the case in many scientic and statistical data sets, an equijoin will effectively result in the empty set, rendering these methods ineffective. Let us examine some of the formal denitions upon which the recent work on histogram design is based. In [16], the authors dene the data distribution of an attribute X in a relation R as follows: Denition 1 The data distribution of X (in R) is the set of tuples where vi are the unique values present in X , and fi are the frequencies of (number of tuples with) value vi , for i = 1; : : :; D; D  jX j. For the continuous data sets we have in mind, where there are relatively few duplicates, fi = 1 for most i. For these data sets, it makes more sense to talk about the density of a value: Denition 2 The function f is called the probability density function (p.d.f.) of a continuous random variable X , 1

Introduction

Selectivity estimation is an important part of query optimization. Estimates can be used to select the best from among many competing access plans. There are two general classes of methods for selectivity estimation: sampling methods and statistical methods. We consider nonparametric statistical methods in this paper. For an instance of work on parametric methods see [2]; for an instance of work on sampling methods, see [5]. Nonparametric methods determine the shape of the distribution from the available data, without necessarily conforming to a formal process model. In this sense, the
 Work performed while with AT&T Labs.

fv1 ; f1; v2 ; f2; : : :; vD ; fD g

(1)

dened for all real x 2 ,1; 1, if

P fX 2 B g =
for any set B of real numbers.

f xdx

Symbol (2)

Of course, one can always quantize continuous data, and in some sense, nite precision storage in digital computers already does this for us. However, if this quantization is done with ne enough granularity, most of the discrete cells will still have no data items in them, and a few will have one. On the other hand, a coarse quantization will result in a reasonable discrete data set while introducing signicant quantization error, which may be unacceptable. In this paper, we focus on the task of range selectivity estimation over univariate and multivariate data. We extend the best discrete methods and propose two new methods based on splines for continuous domains. These methods are implicitly built upon the continuous model of Eq. 2. The rst method, called KernelSplines, involves estimating the data density (p.d.f.) via smooth kernels and then storing a compact approximation of the density as cubic splines. The second method, called OptimalSplines, involves estimating the density via maximum likelihood estimation. The proposed methods are more accurate than histograms. As the results from our experiments on both real and synthetic data sets show, the proposed methods achieve substantially lower (up to 5.5 times) estimation error than the state-of-the-art histograms (equiwidth, endbiased, maxdiff), at exactly the same storage space and with comparable CPU runtime overhead; moreover, the superiority of the proposed spline methods becomes even more dramatic for multivariate data. The bulk of the database literature on histograms is focused towards query optimizers, as is our own work. However, approximate query answering is becoming increasingly important to provide data analysts with interactive responses from large data warehouses [6, 11]. To the extent that many data values (e.g., money amounts) stored in a data warehouse are drawn from a naturally continuous domain, the techniques we present in this paper are applicable to data warehousing contexts as well as to query optimizers. The paper is organized as follows: Section 2 gives some intuition behind the statistical methods in this paper. Section 3 introduces the proposed univariate methods. Section 4 gives the experimental results. Section 5 presents the proposed multivariate methods and some results. Section 6 mentions some related work. Section 7 lists the conclusions and directions for future research. 2

n; m a; b a; b  c; d V,vi F,fi S,si A,ai h

Denition number of data points number of univariate bins or knot points numbers of bivariate bins in each dim 1-d interval range 2-d rectangular range attribute value attribute frequency attribute spread si vi+1 , vi  attribute area ai fi  si kernel bandwidth Table 1. Symbol table.

Background

In this section we review the literature on histograms and briey discuss the intuition behind the concepts used for the proposed methods: specically, cubic splines and kernel density estimation.

2.1

Histograms

The time-honored histogram gives a (lossy) summarization of an attribute. It is constructed according to a partitioning rule for dividing the data into mutually disjoint bins. Each bin represents a range of the data. Associated with each bin is a single number denoting the frequency (count) of items occurring in the given range. Individual values within a bin are approximately reconstructed by making the uniform spread assumption, whereby values are assumed to be placed at equispaced intervals. Frequencies within a bin are approximately reconstructed by making the uniform frequency assumption, whereby all individual frequencies are assumed to be the same. Some traditional examples of histograms are equiwidth, where the bin boundaries are equispaced, and equidepth, where the bin boundaries are placed at quantiles. Following [16], histograms can be classied along three orthogonal axes: partition constraint, sort parameter, and source parameter. The partition constraint is the rule for assigning the data to mutually disjoint bins; the sort parameter, most typically attribute value (V) or frequency (F), is the parameter along which bins represent contiguous ranges; the source parameter, most typically the attribute spread (S), frequency (F), or area (A), is the parameter according to which the partition constraint is based. Table 1 lists these parameters along with their denitions.

Let ps; u denote a histogram with partition constraint p, sort parameter s, and source parameter u. Then equiwidth histograms can be written as equisum(V,S) and equidepth histograms as equisum(V,F). Of the histograms introduced in [16], we consider V-optimal-endbiased(F,F) and maxdiff(V,A). The V-optimal-endbiased(F,F) histogram stores singleton bins and approximates the frequency of the remaining bins with their average frequency. The maxdiff(V,A) histogram puts bin boundaries in between the  , 1-max consecutive (in sort parameter order) area differentials. Other histogram classes were introduced in [16] (e.g., compressed(V,F), compressed(V,A)) but are not considered in this paper for the reasons put forth in Sec. 4.

kernel method 0.25

0.2

0.15
y

0.1

0.05

(a) N =2
0.09 0.08

kernel method

2.2

Cubic Splines
y

0.07 0.06 0.05

Splines are widely used for curve tting in graphics and statistics. The most basic kind of spline is the cubic interpolation spline. Given a set of anchor points, called knots, a piecewise polynomial is constructed by joining each successive pair of knots with a separate cubic polynomial function beginning at one knot and extending to the other. See Fig. 3(c) for an example of a cubic spline. The cubic polynomials satisfy continuity conditions to make the spline continuous and twice differentiable, giving cubic splines a smooth appearance: Denition 3 Given knots a = x0 x1    x = b with values uj at each knot, a cubic spline S is a function on a; b satisfying the following conditions: 1.

0.04 0.03 0.02 0.01 0

(b) N =25 Figure 1. Illustration of Kernel density estimate, with data points indicated by tick marks, (a) for 2 data points, and (b) for 25 data points.

2. 3.

S is a cubic polynomial, denoted Sj , on the interval xj ; xj +1 for 1  j , i.e., Sj x = aj + bj x , xj  + cj x , xj 2 + dj x , xj 3; S xj  = uj for each j ; S is twice differentiable continuously in a; b .

simple: for each data point Xi , a kernel (e.g., a Gaussian) centered about Xi is summed. The kernel is a smooth, symmetric, weighted function that smears out the probability in the neighborhood of the data point Xi . Figure 1 shows what a kernel estimate would look like (a) after 2 data points, and (b) after 25 data points have been summed. The data point values are indicated by tick marks on the xaxis. Denition 4 Given data points X1 ; X2 ; : : :; XN , a kernel ^ is constructed as follows: estimate f
1 f^x = N

There are many different, more sophisticated, types of splines, including regression splines and smoothing splines. Of these, perhaps the most well known is the regression B-spline [3]. In this paper, we only use cubic interpolation splines and their generalization in higher dimensions.

N X K i=1

h x , Xi 

(3)

2.3

Kernel Density Estimation

Kernel estimation is one of the most popular methods in statistics for density estimation [20]. The basic idea is 3

where Kh x is usually a unimodal, symmetric and bounded density function depending on the bandwidth h. The most typical kernel, and the one used in our experi-

ments, is a Gaussian, that is,

Kh x , Xi  = p

h

x,Xi 2 2h2 :

(4)

There is an inherent trade-off in choosing the bandwidth (standard deviation, in the case of the Gaussian) h: wide bandwidths will produce smooth estimates that may hide local features of the density; narrow bandwidths will create artifacts. Some kernel estimation algorithms involve nding the best bandwidth, requiring several iterations [19]. Instead, we use a heuristic common in statistics for choosing a good bandwidth, specically,

11 00 00 11 000 111 00 11 00 11 000 111 00 11 00 11 000 11 00 111


a

1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

b
(b) same bin

ab

h = log2N + 1
4

(a) partial intersection

(5)

which allows the construction of a kernel estimate in one pass.

Figure 2. Two common cases where histograms make the uniform frequency assumption: (a) in two partially intersected bins, and (b) in the same bin.

2.4

Splinegrams
histogram bins are completely intersected and two bins are partially intersected. The frequency in the ranges of the completely intersected bins can be determined exactly; however, the frequency must be approximated at the boundary bins. As mentioned in Sec. 2.1, histograms make the uniform frequency assumption to approximate a partial interval as a fraction of a complete interval. Figure 2(b) shows a range interval a; b properly contained inside a bin. For highly skewed distributions, which are common in real data sets, the uniformity assumption will suffer from bias due to the fact that an unweighted average is being used where a weighted average is called for. Splines, on the other hand, are smooth because, by denition, they are polynomials in between knots and because they satisfy continuity conditions at the boundaries. The selectivity of partially intersected ranges can be estimated analytically via integration, without having to assume uniformity. In the next section, we propose two spline representations in which the inuence of the knot locations is inherently non-local due to their continuous underpinnings. We propose two novel methods involving splines for selectivity estimation: KernelSplines and OptimalSplines. In contrast to histograms and splinegrams, these methods are based on a completely different paradigm that is tailored to continuous attributes. The main idea behind these methods is to assume that the data is generated by a continuous process, and to estimate the underlying p.d.f. using a smooth nonparametric technique. These nonparametric methods have a faster asymptotic rate of convergence of the mean square error than histograms: ON ,4=5, com4

A splinegram is constructed by tting a spline through the midpoints of an equiwidth histogram. The frequencies of each histogram bin at the bins abscissa midpoint serve as spline knots. Splinegrams are similar to what is known in the statistics literature as the frequency polygon, both of which suffer from bias [17]. Figure 3(a) illustrates an example of a splinegram constructed from data that is normally distributed.

Proposed Methods

Recall that a histogram is a collection of bins and frequencies, where each bin represents a data range and its associated frequency summarizes the number of values that lie in the range. Two disadvantages come from the equivalence of a histogram to a piecewise constant function (when bin ranges are continuous). First, the discontinuities imposed by bin boundaries result in bin-dividers having a very local inuence, i.e., incrementing the frequency of one bucket will not affect the frequencies of the other buckets. Second, range query estimation requires the uniform frequency assumption, that all attribute values in a bucket are assumed to have the same frequency [16]. It is the combination of these disadvantages that leads to estimation error accumulated at the boundary bins of a range query. Figure 2 illustrates an example of a histogram summarization. Note the abrupt discontinuities between bins. Figure 2(a) shows a range interval a; b in which three

pared to ON ,2=3, where N is the number of data points [20]. We hope to exploit this in our methods, both of which are explained in detail below.

Experiments

3.1

KernelSplines

Here we compare histograms with the methods proposed in Section 3, namely, the KernelSpline and the OptimalSpline. Section 4.1 discusses the results from estimation accuracy experiments. Section 4.2 discusses the runtime CPU costs of the methods.
method histograms equiwidth eb-continuous splines equispaced end-biased storage description freqs + min + interval 1 singleton-coords + min + interval + avgfreq


Given a set of knot locations at x1 x2    x , a KernelSpline runs a kernel estimator (see Sec. 2.3) to estimate the density at each knot location, which are the y-values of the knots. A cubic spline is then constructed through these knots to approximate the underlying kernel estimate. Thus, KernelSplines give a compact approximation of the p.d.f. Because a kernel density estimate can be obtained in a single pass, the KernelSpline requires one pass to build. Furthermore, a KernelSpline can be maintained incrementally, with no need for periodic re-builds, if the knot locations are xed (for example, if the knots are equispaced). Note that this takes the same time that it takes to build and maintain an equiwidth histogram. Figure 3(b) illustrates a KernelSpline along with the density estimate on which it is based.

bytes
+2 2 +1

, ,

knot heights + min + interval  1 singleton-coords + min + interval + avgfreq

+2 2 +1

Table 2. Required space for the competing methods, where min indicates the minimum value, interval the distance between abscissas, and avgfreq the average frequency of the remaining bins.

3.2

OptimalSplines

Given a set of knot locations at x1 x2    x , an OptimalSpline converges towards optimal coefcients for approximating the density with B-spline basis functions via maximum likelihood estimation. The MLE iteration is performed in main memory on a sample of the data set (see [10] for details). Figure 3(c) illustrates an OptimalSpline.

3.3

Estimating Result Sizes

Our methods require on-the-y construction of a cubic spline when estimating a range selectivity from the knots that are stored. As the experiment mentioned in Sec. 4.2 indicates, this extra CPU overhead compared to histograms is practically negligible. The reason is that computing cubic spline coefcients is cheap. In fact, it involves solving a tridiagonal system in linear time on the number of buckets . Once the spline is constructed, range selectivity estimation involves calculating an analytic integral in each interval. This requires roughly the same CPU time as summing up histogram bucket frequencies. 5

Competing Methods: Following [16], fair comparisons between the methods are ensured by constructing them so that they occupy the same amount of space: approximately 160 bytes. Table 2 summarizes the space requirements of the methods. Equiwidth histograms group contiguous ranges of attribute values into buckets at equallyspaced abscissas; equiknot splines are computed piecewise at equally-spaced knot locations. Both equispaced methods are incrementally maintainable. We do not consider equidepth histograms because they require multiple passes to determine quantiles (e.g., by sorting) and are not incrementally maintainable, which is inefcient for large data sets. We do, however, consider histogram strategies proposed in [16] because, though not incrementally maintainable, they can be built in a single pass. In [16], the authors suggest that maxdiff(V,A) is the best overall histogram method for discrete or categorical data. However, maxdiff(V,A) turns out not to be the method of choice for continuous attributes, since fi = 1 for practically all i. To test this intuition, we implemented maxdiff(V,A) and found that it did not perform as well as equiwidth histograms in every single case we tried. For the same reason, many of the other methods in [16] are not well suited to our problem, in particular, those which use V as the sort parameter and either F or A as the source parameter.

splinegram 2000 1800 1600 1400


frequency

kernelspline 0.045 "splinegram" "bins" "knots" 0.04 0.035 0.03


knot.y knot.y

optimalspline 0.045 "kernelspline" "kernel" "knots" 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 "optimalspline" "knots"

1200 1000 800 600 400 200 0 60 70 80 90 100 110 bucket number 120 130 140

0.025 0.02 0.015 0.01 0.005 0 60 70 80 90 100 110 knot.x 120 130 140

60

70

80

90

100 110 knot.x

120

130

140

(a) splinegram

(b) KernelSpline

(c) OptimalSpline

Figure 3. Three spline-based methods: (a) splinegram (with its associated histogram bins), (b) KernelSpline (with its associated kernel estimate), and (c) OptimalSpline.

Of the recent work on histogram design, we chose to compare against (a variation on) V-optimal-endbiased(F,F) (henceforth called end-biased) because (a) it is the most readily extended to continuous data by way of an initial grouping of the data, and (b) it was reported in previous work to be the best [9]. We do not use the endbiased histogram directly; instead, we rst group the data into equiwidth bins to produce frequency counts and then, using these frequencies as the sort parameter, we bucket similar frequencies together to minimize the variance of frequencies in each bin, just as with end-biased. We call this technique eb-continuous, because continuous data requires the initial grouping step. In the parlance of [16], this method would be called V-Optimal-End-Biased(Equisum(V,S),F). We compared the eb-continuous histogram to spline methods where knots are placed at the highest/lowest y-values, with an average y-value given for the remaining knots, such that the variance is minimized a ` la end-biased histograms. In summary, we compared (4 methods)  (2 binnings). Software: The histogram methods were implemented in C. The density command from Splus was used for computing kernel estimates. Code for B-spline basis generation by MLE, which is the basis for OptimalSpline build-up, can be found in Statlib under the name logspline. Knot/bin locations given by the methods were sent to a basic cubic spline algorithm implemented in C during online size estimation. Queries: Queries were carefully chosen in the 1-d interval a; b so that they resulted in uniform selectivites (between 0-100%), low selectivities (0-20%), and high selectivites (80-100%). Data sets: Three real data sets were used: worldnet, usage data from a random sample of 100,000 AT&T World-

Net users, thyroid, thyroid medical data from 7,200 patients, and cloud, cloud data over 100,000 recorded intervals.1 Error Measure: Following [16], the error E of selectivity estimates for a set Q of N queries is computed as

E = 100 N

X jSq , S0 q j
Sq

where Sq and S 0 q are, respectively, the actual and estimate size of the query result. Since we are interested in range queries, the selectivity estimate is the area under the integral of the estimated p.d.f., over the specied range.

q2Q

(6)

4.1

Estimation Accuracy

We averaged the estimation errors of the histogram and spline methods over 1000 queries. Table 3 summarizes the results for the worldnet, thyroid, and cloud data sets. For brevity, we present only the results of range queries where the selectivity is uniformly chosen between 0-100% and omit high (80-100%) and low (0-20%) selectivity queries, as their relative results were similar. As Table 3(a) shows, the eb-continuous (end-biased) knot (bin) placement did not increase the accuracy much in the worldnet data set and even signicantly performed worse in the case of the OptimalSpline. This suggests that the goal of minimizing within-bucket frequency variances is implicitly better suited for discrete data than for continuous data. As expected, the OptimalSpline consistently achieved the best results, equispaced and eb-continuous, for all of
1 thyroidis from http://www.ics.uci.edu/ mlearn/MLSummary.html;

cloudis from http://cdiac.esd.ornl.gov/cdiac/ndps/ndp026b.html.

method histogram splinegram KernelSpline OptimalSpline

equispaced eb-continuous 55.44% 51.46% 128.90% 130.67% 39.75% 37.52% 16.83% 36.63% (a) worldnet equispaced 80.95% 100.35% 63.98% 14.53% (b) thyroid equispaced 41.92% 93.94% 36.69% 31.02% (c) cloud eb-continuous 62.93% 92.99% 48.98% 26.21%

tionally with increasing storage space for all the methods.

4.2

Estimation Speed

method histogram splinegram KernelSpline OptimalSpline

We compared the overhead of estimating the selectivity at runtime for both histograms and for the proposed spline methods, with = 20. We ran these experiments on each of the data sets listed above and measured the user time on an SGI workstation running IRIX to nd that it took approximately 18 seconds for all methods to compute estimates for 1000 queries. Furthermore, we observed approximately linear scale-up of runtime overhead with increasing storage space.

method histogram splinegram KernelSpline OptimalSpline

eb-continuous 32.11% 76.31% 37.51% 35.69%

Multivariate Range Query Estimation

Table 3. Estimation errors for histograms, splinegrams, KernelSpline, and OptimalSplines, for (a) worldnet, (b) thyroid, and (c) cloud data sets.

In multivariate histograms, the effect of bin discontinuities on estimation error is amplied. Figure 4(a) compares a bivariate histogram to a bivariate KernelSpline built from the same multivariate normal data set. Note how much discontinuity is imposed by the histogram bin dividers. The uniform frequency assumption is also more of an issue for multivariate data because, unlike in a univariate histogram, it is possible for more than two bins (in fact, arbitrarily many) to be partially intersected. In fact, high dimensional range queries will exhibit yet another manifestation of the dimensionality curse, whereby (on average) exponentially more buckets will be partially intersected. Figure 5(a) presents a birds eye view of equiwidth bins and a square range query in the square range a; b  c;d . In this case, the uniformity assumption is made in all 12 bins that are partially intersected. In the following subsections, we consider how to extend the proposed methods of Sec. 3 to multivariate attributes. For brevity, we focus on the bivariate KernelSpline. Section 5.1 describes how to extend the univariate cubic spline for bivariate attributes; Section 5.2 shows how to do so for the KernelSpline; Section 5.3 presents some experimental results of this method compared to bivariate histograms.

the data sets. If incremental maintenance is not needed, then the OptimalSpline is the technique of choice. The KernelSpline gave result sizes that were slightly (5-11%) more accurate than the equiwidth histogram. As reported in the following section, there is no performance price to pay for using the KernelSpline in place of the equiwidth histogram. Recall that splinegrams are constructed from histogram bin midpoints and, therefore, contain knots that are only locally inuential. As such, this method suffers from bias [17]. Because the construction of splinegrams does not properly make use of the underlying continuity that the proposed methods do, we were not surprised to discover how poorly they performed relative to the proposed methods, generating more than 22 times the error of OptimalSpline in one experiment. The lesson here is that one cannot just use splines arbitrarily to achieve good accuracy; rather, the nonparametric method underlying spline construction is critical. Finally, we considered the effect of storage space on accuracy for the histogram and spline methods at different values for . We found that the accuracy improved propor7

5.1

Bicubic Splines

The bicubic interpolation spline is a generalization of the cubic interpolation spline to surfaces. See Fig. 4(b) for an example of a bicubic spline. Here we give a more formal denition:

6000 5000 4000 3000 2000 1000 0 1 2 3 4 5 6 7 8 9 10

111 000 111 000 111 111 000 000 111 111 111 000 000 000 000 111 000 111 000 111 000 000 111 111 000 111 000 111 111 111 000 000 111 111 000 000 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111
a b
(a) bivariate histogram

1 0 0 1 0 1 0 1 0 1 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 000 111 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 0 1 0 1 0 1 0 1 0 1

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.1 50 40 30 20 10 0 0 10 20 30 40

Figure 5. Two common cases where histograms make the uniform frequency assumption: (a) in two partially intersected bins, and (b) in the same bin.

(b) bicubic spline Figure 4. Nonparametric estimation of a bivariate normal density (a) using a bivariate histogram, and (b) using a bicubic spline.

the sum of bivariate rather than univariate Gaussian kernels gauged at the grid points. A bicubic spline is then constructed through these grid points. Figure 4(b) illustrates an example. As in the univariate case, the bivariate KernelSpline can be built in one pass over the data set and is incrementally maintainable. Its runtime overhead involves computing a bicubic spline on the y, which can be computed in Onm by solving m tridiagonal systems of diagonal length n.

5.3

Experiments

G = fxi; yj ja = x0 x1    xn = b; c = y0 y1    ym = dg with heights uij at each knot, a bicubic spline S is a function on R = a; b  c; d
satisfying the following conditions: 1.

Denition 5 Given a rectangular grid

S is a bicubic polynomial, denoted Sij , on the interval xi; xi+1  P yj ; yj +1 , i.e., P 3 a x , x k y , y s; Sij x; y = 3 i j k=0 s=0 ijks 2. S xj  = uij for each i; j ; 3. S is twice differentiable continuously on R.
5.2 Bivariate KernelSplines

Competing Methods: As a representative of the histogram methods, we selected the bivariate equiwidth histogram; as a representative of the proposed spline methods, we selected the bivariate KernelSpline at evenly-spaced knots on the rectangular grid. We compared these two methods at the same storage space: approximately 400 bytes.

method equispaced histogram 21.55% KernelSpline 9.28% (a) binormal

method equispaced histogram 16.86% KernelSpline 5.08% (b) LBcounty

Given a rectangular grid of knots, a KernelSpline estimates the bivariate density at each knot location via bivariate kernel density estimation. Bivariate kernel estimation is very similar to univariate density estimation but with 8

Table 4. Estimation errors of the equiwidth histogram and the KernelSpline, for (a) binormal, and (b) LBcounty.

Software: The bivariate histogram methods were imple-

mented in C. Freely available Splus code from Statlib was used for computing bivariate kernel estimates. Knot locations given by bivariate histograms and kernel estimates were sent to a basic bicubic spline algorithm implemented in C for KernelSplines. Queries: Queries for the bivariate methods were in the form of 2-d rectangular ranges a; b  c; d . These ranges were chosen uniformly. Data sets: Two data sets were used: binormal, 100,000 bivariate normally distributed points and LBcounty, Cartesian coordinates of 63,830 road crossings in Long Beach County, CA. Error Measure: The error measure from Sec. 4 was used. Results: We ran our method on the above data sets for 1000 queries each. Our results show an even more favorable ratio of performance for KernelSplines than what was observed with univariate data: KernelSplines have less than half the error of equiwidth histograms for the binormal data set and less than one third for LBcounty. Table 4 summarizes the results. One would expect bivariate OptimalSplines to outperform bivariate KernelSplines substantially, based on the relative performance of univariate OptimalSplines to KernelSplines, but this remains to be tested.

query answering over continuous valued attributes. Our analysis and experiments reveal that continuous valued attributes call for different statistical proling methods. We have presented two methods based on splines, namely, KernelSplines and OptimalSplines, which have the following benets: They exploit the continuity of the data by tting analytic continuous functions to a smooth estimate of the data density (p.d.f.); They can estimate range selectivites by evaluating an analytic integral, eliminating the need for the Uniform Frequency Assumption, which is known to deteriorate the performance of histograms; They can be implemented quickly using public domain code for Splus. Due to these advantages, our experiments on both real and synthetic data have demonstrated the following: They achieve substantially lower estimation error (as low as one-sixth) than the state-of-the-art histograms, at exactly the same amount of storage space and at comparable CPU time; The relative estimation error of the proposed methods compared to histograms is amplied for multivariate range estimations. For these reasons, the proposed spline methods are more attractive than histograms for evaluating range selectivities over univariate and multivariate continuous attributes. The OptimalSpline is the most accurate for univariate attributes, consistently achieving the lowest estimation error out of all the methods examined, and obtaining excellent results on a variety of data sets (e.g., 14.5% error on a 100K-attribute highly-skewed data set). In the case where periodic offline build-up of a statistical prole is required, the KernelSpline shares some of the nice properties of the OptimalSpline, generating better results than histograms, but requiring only one pass when the data range is known in advance. For multivariate attributes, we have observed that the KernelSpline is more clearly a better choice than the histogram, as the differential in estimation error compared to histograms is magnied in higher dimensions. One could infer that the ratio should be even better for the multivariate OptimalSpline; nonetheless, we would recommend the KernelSpline as the method of choice for multivariate attributes, where build-up time is more expensive, because 9

Related Work

As mentioned, the focus of the past work is on discrete data, where at least some of the attribute values have high multiplicity. Most methods in the literature successfully exploit this assumption, such as the end-biased histogram [9]; the maxdiff histogram [16, 15]; and the compressed histogram [16, 15]; the polynomial-based method in [18]; wavelet-based histograms [13], etc. As shown in [16], the prevailing method is maxdiff(V,A), which we used in our experiments. Remotely related to our work is the query feedback approach of [1]; the use of linear regularization to obtain better estimates from histograms [4]; and the CF-kernel method of [12] to obtain a fast kernel estimation of the density in very large data sets.

Conclusions

The main contribution of this paper is the recognition of the need to distinguish between different attribute value domain types such as discrete and continuous, for obtaining range query selectivity estimates and for approximate

they clearly beat histograms while only requiring a single pass over the data. Future work includes investigating the relative degradation and maintenance work requirements of the proposed methods compared to histograms in the presence of insertions and deletions.

[9] Yannis E. Ioannidis and Viswanath Poosala. Balancing histogram optimality and practicality for query result size estimation. In ACM SIGMOD, pages 233 244, San Jose, CA, June 1995. [10] C. Kooperberg and C. Stone. Logspline density estimation for censored data. Journal of Computational and Graphical Statistics, December 1992. [11] Flip Korn, H.V. Jagadish, and Christos Faloutsos. Efciently supporting ad hoc queries in large datasets of time sequ ences. In Proc. ACM SIGMOD, pages 289300, Tucson, AZ, May 1997. [12] M. Livny, R. Ramakrishnan, and T. Zhang. Fast density and probability estimation using cf-kernel method for very large databases. Tr, University of Wisconsin, Madison, WI, July 1996. [13] Yossi Matias, Jeff Vitter, and Min Wang. Waveletbased histograms for selectivity estimation. In Proc. ACM SIGMOD, pages 448459, Seattle, WA, June 1998. [14] M. Muralikrishna and David J. DeWitt. Equi-depth histograms for estimating selectivity factors for multidimensional queries. In Proc. ACM SIGMOD, pages 2836, Chicago, IL, June 1988. [15] Viswanath Poosala and Yannis E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In Proc. of VLDB, pages 486495, Athens, Greece, August 1997. [16] Viswanath Poosala, Yannis E. Ioannidis, Peter J. Haas, and Eugene J. Shekita. Improved histograms for selectivity estimation of range predicates. In ACM SIGMOD, pages 294305, Montreal, Canada, June 1996. [17] David Scott. Multivariate Density Estimation. Wiley, New York, 1992. [18] Wei Sun, Yibei Ling, Naphtali Rishe, and Yi Deng. An instant and accurate size estimation method for joins and selection in a retrieval-intensive environment. In Proc. ACM SIGMOD, pages 7988, May 1993. [19] E. J. Wegman. Nonparametric probability density estimation: A summary of available methods. Technometrics, 14(3):533545, August 1972. [20] E. J. Wegman. Density estimation. In S. Kotz and N. Johnson, editors, Encyclopedia of Statistical Sciences, volume 2. Wiley, 1983.

Acknowledgments
We would like to thank Christos Faloutsos for his useful comments. We would also like to thank statisticians Andreas Buja, William DuMouchel, and Charles Kooperberg for their constructive discussions.

References
[1] Chungmin M. Chen and Nick Roussopoulos. Adaptive selectivity estimation using query feedback. In Proc. of the ACM-SIGMOD, pages 161172, Minneapolis, MN, May 1994. [2] S. Christodoulakis. Estimating block selectivities. Informations Systems, 9(1), March 1984. [3] C. deBoor. Spline functions. In S. Kotz and N. Johnson, editors, Encyclopedia of Statistical Sciences, volume 2. Wiley, 1983. [4] Christos Faloutsos, H.V. Jagadish, and Nikolaos D. Sidiropoulos. Recovering information from summary data. In VLDB, pages 3645, Athens, Greece, August 1997. [5] Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, and Lynne Stokes. Sampling-based estimation of the number of distinct values of an attribute. Proc. of VLDB, pages 311322, September 1995. [6] Joseph M. Hellerstein, Peter J. Haas, and Helen J. Wang. Online aggregation. In Proc. ACM SIGMOD, pages 171182, Tucson, AZ, May 1997. [7] Y. Ioannidis. Universality of Serial Histograms. Proceedings of VLDB, Dublin Ireland, pages 256277, August 1993. [8] Y. Ioannidis and S. Christodoulakis. Optimal Histograms for Limiting Worst-Case Error Propagation in the Size of Join Results. ACM Transactions on Database Systems, Vol. 18, No. 4, pages 709748, December 1993. 10

Вам также может понравиться