Вы находитесь на странице: 1из 6

Database Architecture Evaluation: Mammals are Getting Better and Better [11]

Kadir Akbudak, Erkan Okuyan


Computer Engineering Department, Bilkent University, Ankara, Turkey Email: {kadir, eokuyan}@cs.bilkent.edu.tr

AbstractIndex compression received great attention in many areas of computer science. Compressing inverted indexes is a well-studied subject in information retrieval. In the data management domain, compressing bitmap indexes are of great importance. Both indexes and the objectives behind compressing them are similar. In this work, we aim to show that the model used in compression of bitmap and inverted indexes are similar. Then, applicability of compression methods to each other will be discussed considering several works in the literature of the two domains. Furthermore, applicability of methods proposed for these two problems will be investigated in the context of a column-store based DBMS.

of inverted index and then we will show how compression can be achieved. 1) Inverted Index Structure: In an inverted index, there is a list of document identiers for every term present in the index. This list of document identiers are also referred as posting list. Structure of the posting list is shown below.

< t; ft ; d1 , d2 , d3 , . . . , dft >

(1)

I. I NTRODUCTION Bitmap compression and inverted index compression via document-gap (d-gap) compression scheme are closely-related problems. Although two problems differ in the sense that objectives are a bit different to facilitate compression calculation, methods attacking either problem seems to be applicable to the other one. Furthermore, column-store architecture for DBMS seems to contain same set of problems as in bitmap compression and inverted index compression. The paper is organized as follows: We will present some preliminary to introduce d-gap compression scheme in Subsection II-A. Then, the TSP formulation for d-gap compression of inverted indices will be given in Subsection II-B and this formulation will be compared to TSP formulation for bitmap compression in Subsection II-C. Then, column-store architecture for DBMS will be introduced and applicability of proposed methods to this application will be discussed in Section III. II. I NVERTED I NDEX AND B ITMAP C OMPRESSION P ROBLEMS In this section, we will present and relate inverted index compression and bitmap compression problems. A. Inverted Index Compression Inverted indices are the most popular indexing mechanism for document search in an information retrieval system [14]. Furthermore they achieve quite better query response times and functionality against alternative data structures [19]. However one problem with inverted indices still needs attention. Considering size of web, size of inverted indices can be quite large and this may cause problems for storage and query response times. Inverted index compression attacks to this problem [15], [5], [7]. Briey we will show the structure In this list t denotes the term identier, ft is the total number of documents that term t appears and di s are the identiers of documents that term t appears. Since we are storing these posting lists for every term in the dataset and number of documents can be quite large, sizes of inverted indices can be quite large. 2) Compressing Posting Lists: d-gap compression scheme is a popular compression scheme for inverted indices. The idea is to exploit the ability of encoding smaller numbers with less bits. There are algorithms to store small numbers with few bits where most notable ones being gamma code, delta code and Golomb code [17]. It is important to note that bulk of research focusing on inverted index compression tries to improve encoding techniques thus this is a well established area with well documented literature. Since it is possible to store small numbers with fewer bits using some prex-free codes, the main aim is to produce smaller numbers in the posting list which stores the same information with the original posting list. d-gap representation achieves this goal and works as follows [17], [18], [12]: 1) Sort document identiers of each posting list in increasing order. 2) Replace document identiers with the difference between it and its predecessor to form a list of d-gaps (difference values) where rst document identier stays the same. Motivation here to use difference of document identiers (dgaps) instead of document identiers does come from the observation that d-gaps are generally much smaller than document identiers. Example showing this execution is shown

below: Original :< ti ; 7; 15, 43, 90, 8, 51, 130, 61 > Sorted :< ti ; 7; 8, 15, 43, 51, 61, 90, 130 > d gap :< ti ; 7; 8, 7, 28, 8, 10, 29, 40 >

Although d-gap compression of posting lists works ne, dgaps in posting list also may uctuate a lot (which affects compression performance). To avoid this problem and consistently achieve low d-gaps, a TSP formulation can be used [15]. A model for this problem explained in the next section. B. Identier Reassignment for Inverted Index Compression Two documents that deal with similar topics or belong to a specic domain usually share a large number of common terms. This, in turn, means document identiers of similar documents appear in many posting lists corresponding to their common terms. Then we can expect, if we assign closer identiers to similar documents, gap values appearing in posting lists of common terms to be smaller. This is achieved by nding a permutation of document identiers, where (i) is the new identier for document di . For example, we have the following documents in our dataset: d1 = cold, weather, rain, f orecast d2 = economy d3 = hot, weather, rain, f orecast The posting lists induced by the original identier assignment is: < cold; 1; 1 > < hot; 1; 3 > < weather; 2; 1, 2 > < rain; 2; 1, 2 > < f orecast; 2; 1, 2 > < economy; 1; 2 > If = (1, 3, 2), then the posting lists become as follows: < cold; 1; 1 > < hot; 1; 3 > < weather; 2; 1, 1 > < rain; 2; 1, 1 > < f orecast; 2; 1, 1 > < economy; 1; 3 > Model presented in the following subsection that reassigns document identiers to reduce average gap values in posting lists is based on TSP formulation. Improvements of 15% on d-gap compressed inverted indices are reported [15].

1) Document Similarity Graph: Since general working principle is assigning closer document identiers to similar documents, a metric of similarity has to be dened. Application benets from assigning close document identiers to documents which happen to be present in many posting lists together (in other words happen to have many common terms) thus similarity in this application is dened as number of common terms between two documents. Then we can construct document similarity graph (DSG) where vertices represent documents and edges represent the similarity between two documents it connects. It is important to note that DSG is a complete graph since similarities between every pair of documents has to be calculated and represented in DSG. An example dataset and corresponding DSG can be seen in Figure 1.

Fig. 1.

An example for document similarity graph

2) Permuting Document Identiers: We need to nd new permutation of document identiers which minimizes sum of d-gap values. This effectively corresponds to nding a traversal of DSG where sum of similarities between documents traversed is maximal. In other words, in traversal of DSG, nd a new permutation of documents so that sum of similarities between a document and its predecessor is maximal. This problem can be formulated as an instance of TSP. One key difference is TSP nds minimal path while our problem requires maximal path. Thus modication of edge weights in DSG is required to be able to run TSP algorithm on DSG. It can be performed via setting weight of an edge to its weight minus weight of maximum weighted edge. An example is presented for intuitive understanding of TSP-based solution to document identier problem is given in Figure 2. C. Relationship between Bitmap compression and Inverted Index Compression In this section we will try to show how closely related these two problems are. 1) Document Similarity Matrix and Term by Document Matrix: DSG can also be represented as an matrix which is will be referred as document similarity matrix (DSM). An example is shown in Figure 3. Another representation (a more detailed one) of documents and terms they contain is term by document matrix (TDM). An example TDM is shown in Figure 4.

Fig. 4. A term-by-document matrix (TDM) on the left and document similarity matrix (DSM) on the right generated from the TDM

Fig. 2.

An example for TSP-based solution to document identier problem

Fig. 3. A document similarity graph (DSG) and corresponding matrix representation which is also called document similarity matrix (DSM)

Constructing TDM and DSM from inverted index is always possible. It should be noted that it is easy to construct DSM using TDM. Reverse is not true however. Nevertheless it is worthy to note that they contain quite the same information in terms of similarities between documents. 2) Bitmap Compression and Inverted Index Compression: Although we wont be getting into much detail (since it is out of the scope of this paper to explain TSP for bitmap compression), it is trivial to see running TSP on TDM is equivalent to the TSP model we have used for bitmap compression. However, we can recall that TSP for bitmap compression nds the best solution for the problem. That is TSP on bitmap index compression minimizes number of runs (this corresponds to maximizing 1-gaps in inverted index compression problem). However, the best solution in inverted index compression does not specically comes from the 1gaps but from the abundance of small gaps. Thus we can conclude that TSP on TDM (or DSG) is just an approximation of the best solution. Since TSP for bitmap index compression and TSP on TDM are equivalent, everything applicable to bitmap compression problem is also applicable to inverted index compression problem. However we should also note that an algorithm that works column by column for bitmap compression favors some posting lists to be better compressed than others (This may or may not be desirable depending on the application). D. Related Work for Inverted Index Compression The easiest method for inverted index compression via document identier reassignment is sorting out document identiers according to their source URLs hoping that URLs similar to each other contains documents similar to each other.

This simple heuristic generally works ne but some sites, like Wikipedia, that contains dissimilar documents is just not suitable for this kind of heuristic. Everything used for bitmap compression can be used on TDM and resulting permutation somewhat reduces sum of d-gaps. However we need to keep in mind since TSP on bitmap index is optimal but TSP on TDM is not, any method simulating TSP may not work just as well for inverted index compression. Furthermore, any algorithm working column by column in bitmap compression assumes priority of terms among other terms (in other words some posting lists will be compressed better than others because of an inherent algorithm property) [13]. Greedy algorithms [15] for document identier reassignment problem expands the path by adding a vertex which is not on but closest to current path. Some variants are available. Main drawback of these algorithms is worst case complexity of these algorithms. Reducing problem size is important so that high quality algorithms run in acceptable time for DSG. SVD on DSM is one method to achieve this. One of the methods which is proposed in [7] for inverted index compression is using k-nearest neighbors for each edge in DSG to reduce the problem size. The problem-size-reduction method is taken from [9]. The method is proposed to increase number of dimensions without losing the relationships between entities. Since DSG is complete, nding a TSP tour costs much. When closest k documents are found for each document via utilizing k-nearest neighbor idea, the corresponding DSG becomes very sparse so that complexity of nding a TSP tour is minimized. Hypergraph-based methods [4] offer good alternatives to solve inverted index compression problem. One important property of these algorithms is they produce small gaps but not necessarily 1-gaps. Thus they may not work for bitmap compression problem (which require 1-gaps for maximization of runs) however it seems they are better suited for inverted index compression problem. III. C OLUMN -S TORE DBMS A RCHITECTURE Most commercial DBMS use record-oriented storage systems. This actually corresponds to storage of attributes of a tuple contiguously. These kinds of DBMS are called rowstore architecture type DBMS and they are optimized for write operations. Row-store architectures behave so well for write operations because a single disk write operation is enough to push all data to disk. However there are applications, like data warehouse applications, that require very little write operations and large number of ad-hoc querying on large amounts of data.

For such applications write optimized row-store architectures are not well suited. For this kind of applications a columnstore architecture where values of a single attribute are stored contiguously are more apt [6]. Efciency of a column-store architecture for such applications are successfully demonstrated in [8], [1], [2], [3]. Main source of this efciency comes from avoiding retrieving irrelevant attributes and only bringing necessary attributes to the memory. In a warehouse environment where typical queries involve aggregates performed over large number of data items, avoiding retrieving the whole dataset to memory using column-store architecture presents a sizable advantage over row-store architecture. One of the rst publications that compares row-store and column-store architectures is [6]. Comparison between two architectures is done stating there is merit to both of these architectures, however community should reconsider its consensus agreement on row-store and evaluate the situation on an application basis. In this work a fully decomposed storage model (DSM) is presented. This storage model stores every attribute separately with a surrogate value that corresponds to surrogate value of its conceptual schema. Furthermore this small table is stored twice one of which is sorted on the attribute value and the other is sorted on the surrogate value. Most known column-store based DBMS systems [16], [11] uses similar storage ideas presented in this paper. Currently there are many proposed methods which tries to speed up query processing on column-store systems using methods like reusing intermediate results [10], dividing database into parts where the bigger part is read optimized [16], exploiting trade off between disk bandwidth and CPU cycles [11] and compressing data for faster retrieval [16]. A. Compression of Dataset for Column-Store DBMS Architecture Current technological trend makes disk bandwidth the bottleneck more and more for many DBMS systems. In other words, current DBMS applications are able to consume more data than they are able to retrieve it from the disk. This results from the fact that CPUs are getting faster at much greater rate than disk bandwidths. One idea to overcome disk bandwidth problem is to trade CPU cycles which are abundant with disk bandwidth which is not. This effectively corresponds to store & retrieve compressed data and use more CPU cycles to process compressed data. The system proposed in [16] uses such encoding schemes of data to overcome disk bandwidth problem. Before going into details of encoding schemes we need to dene projections. A projection can be dened as groups of columns that are sorted on the same attribute. Their benet can be threefold.

Since same data is stored in many projections, this provides some degree of reliability to system in case of hardware failures. Although storing a column in many different projections seems wasteful, it presents some nice opportunities in query execution and reliability. Furthermore using compression schemes, we can alleviate problems presented by redundant storage to some degree. [16] propose to use four different cases for encoding depending on how it is ordered and number of distinct values that the row contains. A column-wise ordering can be two types: Self ordered and foreign ordered. Self ordered means order of the column depends on its own values. Foreign ordered means order of the column depends on some other attribute in the same projection. Number of distinct values is self explanatory and can be low or high. Four different cases for encoding can be listed as: Self Order Few Distinct Values (case1): A run length based encoding is presented. The column can be encoded as a list of triplets (v, f, n) where v is the value stored in the column, f is the position in column that v has rst appeared, n is the number of consecutive vs. Foreign Order Few Distinct Values (case2):A column can be represented as a sequence of tuples, (v, b) such that v is the value stored in the column and b is a bitmap indicating the positions in which the value is stored. For example, column 0, 0, 1, 1, 2, 1, 0, 2, 1 can be encoded using 3 tuples: (0, 110000100) (1, 001101001) and (2, 000010010). Self Order Many Distinct Values (case3): The idea for this scheme is to encode each value in the column as a delta from the previous value in the column. For example 1, 4, 7, 7, 8, 12 can be encoded as 1, 3, 3, 0, 1, 4. Foreign Order Many Distinct Values (case4): There are large number of values so it is more logical to leave value not encoded. It is apparent when we look at these four cases of encoding they are really similar to what we have presented in previous sections. These three encoding schemes corresponds to simple run length encoding, bitmap encoding and d-gap strategy. Although these encoding schemes are effective on their own we can achieve better compression rates by reordering tuples of projections. We will discuss reordering of tuples to achieve better encoding in the next section.

B. Reordering of Tuples for Better Compression We propose reordering tuples to improve compression rate using methods described in the previous section. However it is important to note that there is some kind of tradeoff between query evaluation unit complexity and compression scope. For example, projections are sorted on some attribute in the projection. This is probably a very desirable property for query evaluation while a constraining restriction for column compression. Same rates of compression cant be achieved

They present a way to store attributes that are used together often. Same column may exist in multiple projections, possibly sorted on different attribute each. This allows the selection of best projection to speed up query execution.

if there is some restriction on how we can order tuples. On the other hand if query execution unit can live with unsorted data (by using indices or by just not using the order or by storing both an ordered projection and an unordered projection), much better compression rates can be achieved. Our discussion further is valid for all cases however if keeping order is required, we can execute reordering within the set of tuples of the projection that have same ordering attribute values. Without loss of generality, from now on explanation will be given as no ordering is required. To achieve better compression using a sequence of (v, f, n) triplets as in case1, TSP formulation for compressing bitmaps can be used. To do so we construct a graph G where every tuple in the projection corresponds to a vertex in G and a pair of vertex in G has an edge with cost equal to number of different values in the corresponding tuples for each column. For example for a projection of two attributes, cost of the edge between vertices corresponding to tuples (6, 7) and (5, 8) is 2 whereas cost of the edge between vertices corresponding to tuples (5, 7) and (5, 8) is 1. Then TSP on G gives us the ordering that exactly maximizes the compression rate. Same formulation minimizes, exactly, the size of encoding in the form of (v, b) sequences, as in case2, when we run length encode the bitmap b for each (v, b) pair. Here, it is important to note that compressing a list of bitmaps using our standard bitmap compression algorithms is not the same as this formulation. We need to keep in mind that this formulation is used when there is few distinct values. So the resulting bitmap le will have many columns and few rows. Therefore some method is needed to improve compression of rows rather than columns in this le, which above formulation does. To achieve better compression using the d-gap strategy as in case3, we need to construct a similar graph but with different edge weights. For this formulation we construct a graph G where every tuple in the projection corresponds to a vertex in G and a pair of vertex in G has an edge with cost equal to sum of absolute differences of values in the corresponding tuples for each column (cost of the edge can be thought as the dissimilarity between tuples to relate with inverted index compression). For example for a projection of two attributes, cost of the edge between vertices corresponding to tuples (6, 7) and (5, 8) is 2(abs(6-5)+abs(7-8)) whereas cost of the edge between vertices corresponding to tuples (5, 7) and (5, 8) is 1(abs(5-5)+abs(7-8)). Then TSP on G minimizes the absolute values of d-gaps. Although we need to keep a bit to indicate negative values of d-gaps we can use prex free codes to store d-gap values with less bits. It is important to note that this TSP formulation does not minimize the nal encoding size exactly. Final encoding of columns very much depend on the d-gap compression method, however this TSP formulation minimizes the sum of absolute values of d-gaps. This low d-gaps values then used for effective d-gap encoding by prex free codes. Reordering of tuples also change applicability conditions of case1, case2 and case3 type of encoding. Initially case1 type (v, f, n) encoding was applicable only to self ordered few distinct values case otherwise there would be many runs

so encoding wouldnt be benecial. However we can use this encoding for foreign ordered columns and when there are many distinct values because TSP formulation presented above minimizes the number of runs. Initially case2 type of encoding (v, b) was applicable only to few distinct values case because otherwise there would be many (v, b) pairs which is not desirable. However TSP formulation reduces the encoded lengths of bitmaps in (v, b) pairs so that storing many (v, b) pairs is more manageable. Initially case3 type of d-gap encoding was applicable only to self ordered values case because otherwise there would be very big d-gaps values which can not be encoded efciently using less bits. This problem is elevated using the TSP formulation, since the formulation guarantees the formation of small d-gaps this encoding is now applicable to foreign ordered columns. We still expect case1 encoding to work best on self ordered few distinct values case, case2 encoding to work best on foreign ordered few distinct values case and case3 encoding to work best on self ordered many distinct values case using TSP formulations above however boundaries of usages are not as clear as before, so experimental evidence is needed to know which to use when. But a simple idea is to try all of then and use the best compressing one seems like a good one. Furthermore we now expect some encoding presented here to work on foreign ordered many distinct values columns because TSP formulations presented here will order the column so that one encoding is likely to produce a benecial compression. IV. C ONCLUSION In this work, we discussed the model proposed [15] for inverted index compression based on reassignment of document identiers. We showed that this model is also used for bitmap compression. Furthermore, we claimed that models and related algorithms for either problem are applicable to the other to some extend. Problems are not equivalent (since 1-gaps are necessary for bitmap compression but small gaps are enough for inverted index compression) so modications to algorithms to make them work for the other application are may be needed. Then we have presented another application area, column store based DBMS, where similar constructions as in inverted index compression and bitmap index compression may be benecial. To prove this claim we have presented TSP models based on bitmap index compression and inverted index compression for this application area. Therefore we claim heuristics that that are proposed for bitmap compression and inverted index compression based on TSP formulations are applicable to column store based DBMSs. [11] is an open source column store based DBMS. It represents a good baseline to try our algorithms. To the best of our knowledge, compression of columns is not utilized in [11] thus there is opportunity to decrease query response times by utilizing compression techniques we have presented in this report. This report is constructed as a feasibility report for further research. R EFERENCES
[1] [Online]. Available: http://www.sybase.com/products/databaseservers/ sybaseiq

[2] [Online]. Available: http://www.addamark.com/products/sls.htm [3] [Online]. Available: http://www.kx.com/products/database.php [4] I. C. Baykan, Inverted index compression based on term and document identier reassignment, Masters Thesis, Bilkent University, Computer Engineering Department, 2008. [5] R. Blanco and A. Barreiro, TSP and cluster-based solutions to the reassignment of document identiers, Inf. Retr, vol. 9, no. 4, pp. 499517, 2006. [Online]. Available: http://dx.doi.org/10.1007/ s10791-006-6614-y [6] G. Copeland, W. Alexander, E. Boughter, and T. Keller, Data placement in Bubba. ACM, 1988, vol. 17, no. 3. [7] S. Ding, J. Attenberg, and T. Suel, Scalable techniques for document identier assignment in inverted indexes, in Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010, M. Rappa, P. Jones, J. Freire, and S. Chakrabarti, Eds. ACM, 2010, pp. 311320. [Online]. Available: http://doi.acm.org/10.1145/1772690.1772723 [8] C. D. French, one size ts all database architectures do not work for DSS, SIGMOD Record (ACM Special Interest Group on Management of Data), vol. 24, no. 2, pp. 449450, May 1995. [9] P. Indyk and R. Motwani, Approximate nearest neighbors: Towards removing the curse of dimensionality, in STOC, 1998, pp. 604613. [Online]. Available: http://doi.acm.org/10.1145/276698.276876 [10] M. Ivanova, M. Kersten, N. Nes, and R. Goncalves, An architecture for recycling intermediates in a column-store, ACM Transactions on Database Systems (TODS), vol. 35, no. 4, p. 24, 2010. [11] S. Manegold, M. L. Kersten, and P. A. Boncz, Database Architecture Evolution: Mammals Flourished Long Before Dinosaurs Became Extinct, in Proceedings of the International Conference on Very Large Data Bases (VLDB, 2009). VLDB, August 2009, 10year Best Paper Award for Database Architecture Optimized for the New Bottleneck: Memory Access. In Proceedings of the International Conference on Very Large Data Bases (VLDB), pp 54-65, Edinburgh, United Kingdom, September 1999. [Online]. Available: http://oai.cwi.nl/oai/asset/14299/14299B.pdf [12] A. Moffat and J. Zobel, Self-indexing inverted les for fast text retrieval, ACM Transactions on Information Systems, vol. 14, no. 4, pp. 349379, 1996. [13] A. Pinar, T. Tao, and H. Ferhatosmanoglu, Compressing bitmap indices by data reorganization, in ICDE. IEEE Computer Society, 2005, pp. 310321. [Online]. Available: http://csdl.computer.org/comp/ proceedings/icde/2005/2285/00/22850310abs.htm [14] E. Riloff and L. Hollaar, Text databases and information retrieval, ACM Computing Surveys, vol. 28, no. 1, pp. 133135, Mar. 1996. [Online]. Available: http://www. acm.org/pubs/citations/journals/surveys/1996-28-1/p133-riloff/;http: //www.acm.org/pubs/toc/Abstracts/surveys/234371.html [15] W.-Y. Shieh, T.-F. Chen, J. J.-J. Shann, and C.-P. Chung, Inverted le compression through document identier reassignment, Inf. Process. Manage, vol. 39, no. 1, pp. 117131, 2003. [Online]. Available: http://dx.doi.org/10.1016/S0306-4573(02)00020-1 [16] M. Stonebraker, D. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. ONeil et al., C-store: a columnoriented dbms, in Proceedings of the 31st international conference on Very large data bases. VLDB Endowment, 2005, pp. 553564. [17] I. H. Witten, A. Moffat, and T. C. Bell, Managing gigabytes (2nd ed.): compressing and indexing documents and images. Morgan Kaufmann Publishers Inc., 1999. [18] Zobel and Moffat, Adding compression to a full-text retrieval system, SOFTPREX: SoftwarePractice and Experience, vol. 25, 1995. [19] Zobel, Moffat, and Ramamohanarao, Inverted les versus signature les for text indexing, ACMTDS: ACM Transactions on Database Systems, vol. 23, 1998.

Вам также может понравиться