Академический Документы
Профессиональный Документы
Культура Документы
Partitioning
ABSTRACT One area in which the Semantic Web community differs from
Efficient management of RDF data is an important factor in real- the relational database community is in its choice of data model.
izing the Semantic Web vision. Performance and scalability issues The Semantic Web data model, called the Resource Description
are becoming increasingly pressing as Semantic Web technology Framework, [9] or RDF, represents data as statements about re-
is applied to real-world applications. In this paper, we examine the sources using a graph connecting resource nodes and their property
reasons why current data management solutions for RDF data scale values with labeled arcs representing properties. Syntactically, this
poorly, and explore the fundamental scalability limitations of these graph can be represented using XML syntax (RDF/XML). This is
approaches. We review the state of the art for improving perfor- typically the format for RDF data exchange; however, structurally,
mance for RDF databases and consider a recent suggestion, prop- the graph can be parsed into a series of triples, each representing a
erty tables. We then discuss practically and empirically why this statement of the form < sub ject, property, ob ject >, which is the
solution has undesirable features. As an improvement, we propose notation we follow in this paper. These triples can then be stored
an alternative solution: vertically partitioning the RDF data. We in a relational database with a three-column schema. For example,
compare the performance of vertical partitioning with prior art on to represent the fact that Serge Abiteboul, Rick Hull, and Victor
queries generated by a Web-based RDF browser over a large-scale Vianu wrote a book called Foundations of Databases we would
(more than 50 million triples) catalog of library data. Our results use seven triples1 :
show that a vertical partitioned schema achieves similar perfor- person1 isNamed Serge Abiteboul
mance to the property table technique while being much simpler person2 isNamed Rick Hull
to design. Further, if a column-oriented DBMS (a database archi- person3 isNamed Victor Vianu
tected specially for the vertically partitioned case) is used instead book1 hasAuthor person1
of a row-oriented DBMS, another order of magnitude performance book1 hasAuthor person2
improvement is observed, with query times dropping from minutes book1 hasAuthor person3
to several seconds. book1 isTitled Foundations of Databases
union the results of joining each propertys table with t and count
the Postgres tuple header is 27 bytes compared with just 8 bytes of Triple Store
C-Store
Vertical Partitioning
actual data per tuple and so the Postgres table scan needs to read 35
Figure 4: Query 6 performance as number of triples scale.
bytes per tuple (actually, more than this if one includes the pointer
to the tuple in the page header) compared with just 8 for C-Store.
Another reason why C-Store performs better is that it uses an
index nested loops join to join keys with the strings dictionary ta- the property table approach would be improved if Postgres imple-
ble while Postgres chooses to do a merge join. This final join takes mented GROUP BY GROUPING SETs.
5 seconds longer in Postgres than it does in C-Store (this 5 sec- Second, for the vertically partitioned schema, Postgres processes
ond overhead is observable in the other queries as well). These subject-subject joins non-optimally. For queries that feature the cre-
two reasons account for the majority of the performance differ- ation of a temporary table containing subjects that are to be joined
ence between the systems; however the other advantages of using a with the subjects of the other properties tables, we know that the
column-store described in Section 3.2 are also a factor. temporary list of subjects will be in sorted order, as it comes from a
Q2 shows why avoiding the expensive subject-subject joins of table that is clustered on subject. Postgres does not carry this infor-
the triple-store is crucial, since the triple-store performs much more mation into the temporary table, and will only perform a merge join
slowly than the other systems. The vertical partitioning approach is for intermediate tuples that are guaranteed to be sorted. To simulate
outperformed by the property table approach since it performs 28 the fact that other databases would maintain the metadata about the
merge joins that the property table approach does not need to do sorted temporary subject list, we create a clustered index on the
(again, the property table approach is helped by the fact that we use temporary table before the UNION-JOIN operation. We only in-
the optimal property table for each query). cluded the time to create the temporary table and the UNION-JOIN
As expected, the multiple sequential scans of the property table operations in the total query time, as the clustering is a Postgres
hurt it in Q3. Q4 is so highly selective that the query results for all implementation artifact.
but C-Store are quite similar. The results of the optimal property Further, Postgres does not assume that a table clustered on an
table in Q5-Q7 are on par with those of the vertically partitioned attribute is in perfectly sorted order (due to possible modifications
option, and show that subject-object joins hurt each of the stores after the cluster operation), and thus can not perform the merge join
significantly. directly; rather it does so in conjunction with an index scan, as the
On the whole, vertically partitioning a database provides a signif- index is in sorted order. This process incurs extra seeks as the leaves
icant performance improvement over the triple-store schema, and of the B+ tree are traversed, leading to a significant cost effect com-
performs similarly to property tables. Given that vertical partition- pared to the inexpensive merge join operations of C-Store.
ing in a row-oriented database is competitive with the optimal sce- With a different choice of RDBMS, performance results might
nario for a property table solution, we conclude that they are the differ, but we remain convinced that Postgres was a good choice
preferable solution since they are simpler to implement. Further, if of RDBMS, given that it handles NULL values so well, and thus
one uses a database designed for vertically partitioned data such as enabled us to fairly benchmark the property table solutions.
C-Store, additional performance improvement can be realized. C-
Store achieved nearly-interactive time on our benchmark running
on a single machine that is two years old.
6.5 Scalability
We also note that multi-valued attributes play a role in reducing Although the magnitude of query performance is important, an
the performance of the property table approach. Because we im- arguably more important factor to consider is how performance
plement multi-valued attributes in property tables as arrays, simple scales with size of data. In order to determine this, we varied the
indexing can not be performed on these arrays, and the GiST [28] number of triples we used from the library dataset from one mil-
indexing of integer arrays performs worse than a sequential scan of lion to fifty million (we randomly chose what triples to use from
the property table. a uniform distribution) and reran the benchmark queries. Figure 4
Finally, we remind the reader that the property tables for each shows the results of this experiment for query 6. Both vertical par-
query are idealized in that they only include the subset of columns titioning schemes (Postgres and C-Store) scale linearly, while the
that are required for the query. As we will show in Section 6.7, triple-store scales super-linearly. This is because all joins for this
poor choice in columns for a property table will lead to less-than- query are linear for the vertically partitioned schemes (either merge
optimal results, whereas the vertical partitioning solution represents joins for the subject-subject joins, or index scan merge joins for
the best- and worst-case scenarios for all queries. the subject-object inference step); however the triple-store sorts the
intermediate results after performing the three selections and be-
fore performing the merge join. We observed similar results for all
6.4.1 Postgres as a Choice of RDBMS queries except queries 1, 4, and 7 (where the triple-store also scales
There are several notes to consider that apply to our choice of linearly, but with a much higher slope relative to the vertically par-
Postgres as the RDBMS. First, for Q3 and Q4, performance for titioned schemes).
Q5 Q6 Query Wide Property Table Property Table
Property Table 39.49 (17.5% faster) 62.6 (38% faster) % slowdown
Vertical Partitioning 4.42 (92% faster) 65.84 (22% faster) Q1 60.91 381%
C-Store 2.57 (84% faster) 2.70 (75% faster) Q2 33.93 85%
Q3 584.84 1%
Table 2: Query times (in seconds) for Q5 and Q6 after the Q4 44.96 58%
Records:Type path is materialized. % faster = 100|originalnew|
original
. Q5 76.34 60%
Q6 154.33 53%
Q7 24.25 298%
6.6 Materialized Path Expressions
Table 3: Query times in seconds comparing a wider than nec-
As described in Section 4, materialized path expressions can re-
essary property table to the property table containing only the
move the need to perform expensive subject-object joins by adding
columns required for the query. % Slowdown = 100|originalnew| .
additional columns to the property table or adding an extra table original
to the vertically partitioned and column-oriented solutions. This Vertically partitioned stores are not affected.
makes it possible to replace subject-object joins with cheaper subject-
subject joins. Since Queries 5 and 6 contain subject-object joins, we the notion that while property tables can sometimes outperform ver-
reran just those experiments using materialized path expressions. tical partitioning on a row-oriented store, a poor choice of property
Recall that in these queries we join object values from the Records table can result in significantly poorer query performance. The ver-
property with subject values to get those subjects that can be in- tically partitioned solutions are impervious to such effects.
ferred to be a particular type through the Records property.
For the property table approach, we widened the property table
by adding a new column representing the materialized path expres- 7. CONCLUSION
sion: Records:Type. This column indicates the type of entity that The emergence of the Semantic Web necessitates high-performance
is related to a subject through the Records property (if a subject data management tools to manage the tremendous collections of
does not have a Records property defined, its value in this column RDF data being produced. Current state of the art RDF databases
will be NULL). Similarly, for the vertically partitioned and column- triple-stores scale extremely poorly since most queries require
oriented solutions, we added a table containing a subject column multiple self-joins on the triples table. The previously proposed
and a Records:Type object column, thus allowing one to find the property table optimization has not been adopted in most RDF
Type of objects that a resource Records with a cheap subject-subject databases, perhaps due to its complexity and inability to handle
merge join. The results are displayed in Table 2. multi-valued attributes. We showed that a poorly-selected property
It is clear that materializing the path expression and removing table can result in a factor of 3.8 slowdown over an optimal prop-
the subject-object join results in significant improvement for all erty table, thus making the solution difficult to use in practice. As
schemas. However, the vertically partitioned schemas see a greater an alternative to property tables, we proposed vertically partition-
benefit since the materialized path expression is multi-valued (which ing tables and demonstrated that they achieve similar performance
is the common case, since if at least one property along the path is as property tables in a row-oriented database, while being simpler
multi-valued, then the materialized result will be multi-valued). to implement. Further, we showed that on a version of the C-Store
In summary, Q5 and Q6, which used to take 400 and 200 seconds column-oriented database, it is possible to achieve a factor of 32
respectively on the triple-store, now take less than three seconds on performance improvement over the current state of the art triple
the column-store. This represents a two orders of magnitude perfor- store design. Queries that used to take hundreds of seconds can now
mance improvement! be run in less than ten seconds, a significant step toward interactive-
time semantic web content storage and querying.
6.7 The Effect of Further Widening
Given that semantic web content is likely to have an unstruc- 8. ACKNOWLEDGMENTS
tured schema, clustering algorithms will not always yield property We thank George Huo and the Postgres development team for
tables that are the perfect width for all queries. We now experi- their advice on our Postgres implementation, and Michael Stone-
mentally demonstrate the effect of property tables that are wider braker for his feedback on this paper. This work was supported by
than they need to be for the same queries run in the experiments the National Science Foundation under grants IIS-048124, CNS-
above. Row-stores traditionally perform poorly relative to verti- 0520032, IIS-0325703 and two NSF Graduate Research Fellow-
cally partitioned schemas and column-stores when queries need to ships.
access only a few columns of a wide table, so we expect the per-
formance of the property table implementation to degrade with in- 9. REFERENCES
creasing table width. To measure this, we synthetically added 60 [1] Library catalog data.
non-sparse random integer-valued columns to the end of each tuple http://simile.mit.edu/rdf-test-data/barton/.
in the widest property table in Postgres. This resulted in an approx-
[2] Longwell website. http://simile.mit.edu/longwell/.
imately 7 GByte increase in database size. We then re-ran Q1-Q7
on this wide property table. The results are shown in Table 3. [3] Redland RDF Application Framework.
Since each of the queries (except query 1) executes in two parts, http://librdf.org/.
first creating a temporary table containing the subset of the rele- [4] Simile website. http://simile.mit.edu/.
vant data for that query, and then executing the rest of the query on [5] Swoogle. http://swoogle.umbc.edu/.
this subset, we see some variance in % slowdown numbers, where [6] Uniprot rdf dataset.
smaller slowdown numbers indicate that a majority of the query http://dev.isb-sib.ch/projects/uniprot-rdf/.
time was spent on the second stage of query processing. However, [7] Wordnet rdf dataset.
each query sees some degree of slowdown. These results support http://www.cogsci.princeton.edu/wn/.
[8] World Wide Web Consortium (W3C). [30] J. Shanmugasundaram, K. Tufte, C. Zhang, G. He, D. J.
http://www.w3.org/. DeWitt, and J. F. Naughton. Relational databases for
[9] RDF Primer. W3C Recommendation. querying XML documents: Limitations and opportunities. In
http://www.w3.org/TR/rdf-primer, 2004. Proc. of VLDB, pages 302314, 1999.
[10] RDQL - A Query Language for RDF. W3C Member [31] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen,
Submission 9 January 2004. M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J.
http://www.w3.org/Submission/RDQL/, 2004. ONeil, P. E. ONeil, A. Rasin, N. Tran, and S. B. Zdonik.
[11] SPARQL Query Language for RDF. W3C Working Draft 4 C-Store: A column-oriented DBMS. In VLDB, pages
October 2006. 553564, 2005.
http://www.w3.org/TR/rdf-sparql-query/, 2006. [32] Y. Theoharis, V. Christophides, and G. Karvounarakis.
[12] D. Abadi, A. Marcus, S. Madden, and K. Hollenbach. Using Benchmarking database representations of RDF/S stores. In
the Barton libraries dataset as an RDF benchmark. Technical Proc. of ISWC, 2005.
Report MIT-CSAIL-TR-2007-036, MIT. [33] K. Wilkinson. Jena property table implementation. In SSWS,
[13] D. J. Abadi. Column stores for wide and sparse data. In 2006.
CIDR, 2007. [34] K. Wilkinson, C. Sayers, H. Kuno, and D. Reynolds. Efficient
[14] D. J. Abadi, S. Madden, and M. Ferreira. Integrating RDF Storage and Retrieval in Jena2. In SWDB, pages
Compression and Execution in Column-Oriented Database 131150, 2003.
Systems. In SIGMOD, 2006.
[15] D. J. Abadi, D. S. Myers, D. J. DeWitt, and S. R. Madden. APPENDIX
Materialization strategies in a column-oriented DBMS. In
Proc. of ICDE, 2007. Below are the seven benchmark queries as implemented on a triple-
store. Note that for clarity of presentation, the predicates are shown
[16] R. Agrawal, A. Somani, and Y. Xu. Storage and Querying of
on raw strings (instead of the dictionary encoded values) and the
E-Commerce Data. In VLDB, 2001.
post-processing dictionary decoding step is not shown. Further, URIs
[17] S. Alexaki, V. Christophides, G. Karvounarakis,
have been appreviated (the full URIs and the list of 28 properties in
D. Plexousakis, and K. Tolle. The ICS-FORTH RDFSuite:
the properties table are presented in our technical report [12]).
Managing voluminous RDF description bases. In SemWeb,
2001.
[18] J. Beckmann, A. Halverson, R. Krishnamurthy, and Query1: Query5:
J. Naughton. Extending RDBMSs To Support Sparse
SELECT A.obj, count(*) SELECT B.subj, C.obj
Datasets Using An Interpreted Attribute Storage Format. In FROM triples AS A FROM triples AS A, triples AS B,
ICDE, 2006. WHERE A.prop = "<type>" triples AS C
GROUP BY A.obj WHERE A.subj = B.subj
[19] P. A. Boncz and M. L. Kersten. MIL primitives for querying AND A.prop = "<origin>"
a fragmented world. VLDB Journal, 8(2):101119, 1999. Query2: AND A.obj = "<info:marcorg/DLC>"
[20] P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: AND B.prop = "<records>"
SELECT B.prop, count(*) AND B.obj = C.subj
Hyper-pipelining query execution. In CIDR, pages 225237, FROM triples AS A, triples AS B, AND C.prop = "<type>"
2005. properties AS P AND C.obj != "<Text>"
WHERE A.subj = B.subj
[21] V. Bonstrom, A. Hinze, and H. Schweppe. Storing RDF as a AND A.prop = "<type>" Query6:
graph. In Proc. of LA-WEB, 2003. AND A.obj = "<Text>"
[22] J. Broekstra, A. Kampman, and F. van Harmelen. Sesame: A AND P.prop = B.prop SELECT A.prop, count(*)
GROUP BY B.prop FROM triples AS A,
Generic Architecture for Storing and Querying RDF and properties AS P,
RDF Schema. In ISWC, pages 5468, 2002. Query3: (
[23] E. I. Chong, S. Das, G. Eadon, and J. Srinivasan. An Efficient (SELECT B.subj
SELECT B.prop, B.obj, count(*) FROM triples AS B
SQL-based RDF Querying Scheme. In VLDB, pages FROM triples AS A, triples AS B, WHERE B.prop = "<type>"
12161227, 2005. properties AS P AND B.obj = "<Text>")
WHERE A.subj = B.subj UNION
[24] G. P. Copeland and S. N. Khoshafian. A decomposition AND A.prop = "<type>" (SELECT C.subj
storage model. In Proc. of SIGMOD, pages 268279, 1985. AND A.obj = "<Text>" FROM triples AS C,
[25] J. Corwin, A. Silberschatz, P. L. Miller, and L. Marenco. AND P.prop = B.prop triples AS D
GROUP BY B.prop, B.obj WHERE C.prop = "<records>"
Dynamic tables: An architecture for managing evolving, HAVING count(*) > 1 AND C.obj = D.subject
heterogeneous biomedical data in relational database AND D.prop = "<type>"
management systems. Journal of the American Medical Query4: AND D.obj = "<Text>")
Informatics Association, 14(1):8693, 2007. ) AS uniontable
SELECT B.prop, B.obj, count(*) WHERE A.subj = uniontable.subj
[26] D. Florescu and D. Kossmann. Storing and querying XML FROM triples AS A, AND P.prop = A.prop
triples AS B,
data using an RDMBS. IEEE Data Eng. Bull., 22(3):2734, triples AS C,
GROUP BY A.prop
1999. properties AS P Query7:
[27] S. Harris and N. Gibbins. 3store: Efficient bulk RDF storage. WHERE A.subj = B.subj
AND A.prop = "<type>" SELECT A.subj, B.obj, C.obj
In In Proc. of PSSS03, pages 115, 2003. AND A.obj = "<Text>" FROM triples AS A, triples AS B,
[28] J. M. Hellerstein, J. F. Naughton, and A. Pfeffer. Generalized AND P.prop = B.prop triples AS C
search trees for database systems. In Proc. of VLDB 1995, AND C.subj = B.subj WHERE A.prop = "<Point>"
AND C.prop = "<language>" AND A.obj = "end"
Zurich, Switzerland, pages 562573. AND C.obj = AND A.subj = B.subject
[29] R. MacNicol and B. French. Sybase IQ Multiplex - Designed "<language/iso639-2b/fre>" AND B.prop = "<Encoding>"
For Analytics. In VLDB, pages 12271230, 2004. GROUP BY B.prop, B.obj AND A.subj = C.subject
HAVING count(*) > 1 AND C.prop = "<type>"