Вы находитесь на странице: 1из 46

TDB

TDB is a component of Jena for RDF storage and query, as well as the full range of Jena APIs. An TDB store can be accessed and managed with the provided command line scripts and via the Jena API.

Contents
[hide]

1 Status 2 Downloads 3 Documentation 4 Subversion 5 Support

Status
TDB can be used as a high performance, non-transactional, RDF store on a single machine. Documentation on the wiki describes the latest version unless otherwise noted.

Downloads
TDB is distributed from the Jena project SourceForge. Jena Download area It is available from the Jena development Maven repository for Maven and Apache Ivy. Jena repository Jena development repository

Documentation

TDB Installation TDB Requirements Command line utilities Use from Java Use on 64 bit or 32 bit Java systems Datasets and Named Graphs Assemblers for Graphs and Datasets Dynamic Datasets: Query a subset of the named graphs Quad filtering: Hide information in the dataset Value Canonicalization TDB Design The TDB Optimizer TDB Configuration Joseki Integration

Subversion
TDB Subversion repository on SourceForge.

Support
Email to jena-dev@groups.yahoo.com.

TDB/Installation

TDB requires Java 6 (versions to 0.8.1) and Java 5 for version afterwards. TDB is distributed as a complete download and also via Apache Maven (groupId com.hp.hpl.jena, artifact tdb). The Jena repositories are:

http://openjena.org/repo (this is mirrored to the central maven repository). http://openjena.org/repo-dev for development builds.

After downloading a TDB distribution, place all the jars in the lib/ directory on the classpath. TDB itself is a single jar, the other jars in lib/ are a consistent snapshot at the time of release. The TDB build system uses maven but there is also an ant build as well using "ant jar".

TDB/Requirements
TDB up to version 0.8.1 requires Java 6. Versions 0.8.2 onwards require Java 5. TDB can run on 32 bit or 64 bit JVMs. It adapts to the underlying architecture by choosing different file access mechanisms. 64 bit Java is preferred for large scale and production deployments. On 64 bit Java, TDB uses memory mapped files. On 32 bit platforms, TDB uses in-heap caching of file data. In practice, the JVM heap size should be set to at least 1Gbyte. While there is no inherent scaling limits on the size of the database but, in practice, only one large dataset can be handled per TDB instance. The on-disk file format is compatible between 32 and 64 bit systems and databases can be transferred between systems by file copy if the databases are not in use (no TDB instance is accessing them at the time). Databases can not be copied while TDB is running, even if TDB is not actively processing a query or update.

TDB/Commands
Contents
[hide]

1 Scripts o 1.1 Script set up o 1.2 Argument Structure o 1.3 Setting options from the command line 2 TDB Commands o 2.1 Store description o 2.2 tdbloader o 2.3 tdbquery o 2.4 tdbdump o 2.5 tdbstats

Scripts The directory bin/ contains shell scripts to run the commands from the command line. The scripts are bash scripts which also run over Cygwin. Script set up Set the environment variable TDBROOT to the root of the the TDB installation. They are bash scripts, and work on Linux and Cygwin for MS Windows.
$ PATH=$TDBROOT/bin:$PATH

Alternatively, there are wrapper scripts in $TDBROOT/bin2 which can be placed in a convenient directory that is already on the shell command path. Argument Structure Each command then has command-specific arguments described below. All commands support --help to give details of named and positional arguments. There are two equivalent forms of named argument syntax:
--arg=val --arg val

Setting options from the command line TDB has a number of configuration options which can be set from the command line using:
--set tdb:symbol=value

Using tdb: is really a short hand for the URI prefix http://jena.hpl.hp.com/TDB# so the full URI form is
--set http://jena.hpl.hp.com/TDB#symbol=value

TDB Commands
Store description TDB commands use an assembler description for the persistent store
--desc=assembler.ttl --tdb=assembler.ttl

or a direct reference to the directory with the index and node files:
--loc=DIRECTORY --location=DIRECTORY

The assembler description follow the form for a dataset given in TDB assembler description page. If neither assembler file nor location is given, --desc=tdb.ttl is assumed.
tdbloader

Bulk loader and index builder. Performan bulk load operations more efficiently than simply reading RDF into a TDBback model.
tdbquery

Invoke a SPARQL query on a store. Use --time for timing information. The store is attached on each run of this command so timing includes some overhead not present in a running system. Details about query execution can be obtained -- see notes on the TDB Optimizer.
tdbdump

(Version 0.8.5)

Dump the store in N-Quads format.

tdbstats

Produce a statistics for the dataset. See the TDB Optimizer description..

TDB/JavaAPI
All the operations of the Jena API including the SPARQL query support. The application obtains a model or RDF datasets from TDB then uses it as for any other model or dataset. See also Concurrency and Locking.

Contents
[hide]

1 Constructing a model or dataset o 1.1 Using a directory name o 1.2 Using an assembler file 2 Bulkloader 3 Caching and synchronization 4 Concurrency

Constructing a model or dataset


The class TDBFactory contains the static factory methods for creating and connecting to a TDB-backed graph or an RDF dataset. Models and datasets should be closed after use. See also the examples in src-examples. An application can specify the model or dataset by: 1. Giving a directory name 2. Giving an assembler file If a directory is empty, the TDB files for indexes and node table are created. If the directory contains files from a previous application run, TDB connects to the data already there. Closing the model or dataset is important. Any updates made are forced to disk if they have not been written already. Using a directory name
// Direct way: Make a TDB-backed Jena model in the named directory. String directory = "MyDatabases/DB1" ; Model model = TDBFactory.createModel(directory) ; ... model.close() ; // Direct way: Make a TDB-backed dataset String directory = "MyDatabases/Dataset1" ; Dataset dataset = TDBFactory.createDataset(directory) ; ... dataset.close() ;

Using an assembler file


// Assembler way: Make a TDB-back Jena model in the named directory. // This way, you can change the model being used without changing the code. // The assembler file is a configuration file. // The same assembler description will work in Joseki. String assemblerFile = "Store/tdb-assembler.ttl" ; Model model = TDBFactory.assembleModel(assemblerFile) ; ... model.close() ; String assemblerFile = "Store/tdb-assembler.ttl" ; Dataset dataset = TDBFactory.assembleDataset(assemblerFile) ; ...

dataset.close() ;

See the TDB assembler documentation for details.

Bulkloader
The bulkloader is a faster way to load data into an empty graph than just using the Jena update operations. It is accessed through the command line utility tdbloader.

Caching and synchronization


TDB employs caching at various levels, from RDF terms to disk blocks. It is important to flush all caches to make the file state consistent with the cached states because some caches are write-behind so unwritten chnages may be held inmemory. Caches are flushes when a model or dataset is closed.
Model model = TDBFactory.createModel(disk location) ; ... model.close() ; Dataset dataset = TDBFactory.createDataset(disk location) ; ... dataset.close() ;

In addition, while TDB does not support full transaction semantics, the Model.commit does provide for synchronising the in-memory and disk states.
model.commit() ;

TDB provides an explicit call for model and dataset objects for synchronization with disk:
Model model = ... ; TDB.sync(model) ; Dataset dataset = ... TDB.sync(dataset ) ; ;

Any dataset or model can be passed to these functions - if they are not backed by TDB then no action is taken and the call merely returns without error.

Concurrency
TDB provides a Multiple Reader or Single Writer (MRSW) policy for concurrency access. Applications are expected to adhere to this policy - it is not automatically checked. One gotcha is Java iterators. An iterator that is moving over the database is making read operations and no updates to the dataset are possible while an iterator is being used.

TDB/JVM-64-32
From Jena Wiki Jump to: navigation, search TDB runs on both 32-bit and 64-bit Java Virtual Machines. The same file formats are used on both systems and database files can be transferred between architectures (no TDB system should be running for the the database at the time of copy). What differs is the file access mechanism used.

The file access mechanism can be set explicitly, but this is not a good idea for production usage, only for experimentation - see the File Access mode option.

64-bit Java
On 64-bit Java, TDB uses memory mapped files and the operating system handles much of the caching between RAM and disk. The amount of RAM used for file caching increases and decreases as other application run on the machine. The fewer other programs running on the machine, the more RAM will be available for file caching. TDB is faster on a 64 bit JVM because more memory is available for file caching.

32-bit Java
On 32-bit Java, TDB uses it's own file caching to enable large databases. 32-bit Java limits the address space of the JVM to about 1.5Gbytes (the exact size is JVM-dependent), and this includes memory mapped files, even though they are not in the Java heap. The JVM heap size may need to be increased to make space for the disk caches used by TDB.

TDB/Datasets
From Jena Wiki Jump to: navigation, search An RDF Dataset is a collection of one, unnamed, default graph and zero, or more named graphs. In a SPARQL query, a query pattern is matched against the default graph unless the GRAPH keyword is applied to a pattern.

Contents
[hide]

1 Dataset Storage 2 Dataset Query

Dataset Storage
One file location (directory) is used to store one RDF dataset. The unnamed graph of the dataset is held as a single graph while all the named graphs are held in a collection of quad indexes. Every dataset is obtained via TDBFactory.createDataset(Location) within a JVM is the same dataset. If a model is obtained from via TDBFactory.createModel(Location) there is a hidden, shared dataset and the appropriate model is returned.

Dataset Query
(TDB Version 0.7.0 and later) There is full support for SPARQL query over named graphs in a TDB-back dataset. All the named graphs can be treated as a single graph which is the union (RDF merge) of all the named graphs. This is given the special graph name <urn:x-arq:UnionGraph> in a GRAPH pattern. When querying the RDF merge of named graphs, the default graph in the store is not included. This feature applies to queries only. It does not affect the storage nor does it change loading. Alternatively, if the symbol tdb:unionDefaultGraph (see TDB Configuration) is set, the unnamed graph for the query is the union of all the named graphs in the datasets. The stored default graph is ignored and is not part of the the data of the union graph although it is accessible by the special name <urn:x-arq:DefaultGraph> in a GRAPH pattern.

Special Graph Names URI


urn:x-arq:UnionGraph

Meaning The RDF merge of all the named graphs in the datasets of the query. The default graph of the dataset, used when the default graph of the query is the union graph.

urn:x-arq:DefaultGraph

Note that setting tdb:unionDefaultGraph does not affect the default graph or default model obtained with dataset.getDefaultModel(). The RDF merge of all named graph can be accessed as the named graph urn:x-arq:UnionGraph using Dataset.getNamedModel("urn:x-arq:UnionGraph") .

TDB/Assembler
From Jena Wiki Jump to: navigation, search Assemblers are a general mechanism in Jena to describe objects to be built, often these objects are models and datasets. Assemblers are used heavily in Joseki for datset and model descriptions, for example. SPARQL queries operate over an RDF dataset, which is a unnamed, default graph and zero or more named graphs. Having the description in a file means that the data that the application is going to work on can be changed without changing the program code.

Dataset
This is needed for use in Joseki. A dataset can be constructed in an assembler file:
@prefix tdb: @prefix rdf: @prefix ja: <http://jena.hpl.hp.com/2008/tdb#> . <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . <http://jena.hpl.hp.com/2005/11/Assembler#> .

[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" . <#dataset> rdf:type tdb:location "DB" ; . tdb:DatasetTDB ;

Only one dataset can be stored in a location (filing system directory). The first section declares the prefixes used later:
@prefix tdb: @prefix rdf: @prefix ja: <http://jena.hpl.hp.com/2008/tdb#> . <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . <http://jena.hpl.hp.com/2005/11/Assembler#> .

then there is a statement that causes TDB to be loaded. TDB initialization occurs automatically when loaded. The TDB jar must be on the Java classpath.

While order in this file does not matter to the machine, because in this case the jena assembler system checks for any ja:loadClass statements before any attempt to assemble an object is made, having it early in the file is helpful to any person looking at the file.
[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .

And finally there is the description of the TDB dataset itself:


<#graph> rdf:type tdb:DatasetTDB ; tdb:location "DB" ;

The property tdb:location gives the file name as a string. It is relative to the applications current working directory, not where the assembler file is read from. The dataset description is usually found by looking for the one subject with type tdb:GraphDataset. If more than one graph is given in a single file, the application will have to specify which description it wishes to use.

Graph
A single graph can be described as well:
@prefix tdb: @prefix rdf: @prefix ja: <http://jena.hpl.hp.com/2008/tdb#> . <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . <http://jena.hpl.hp.com/2005/11/Assembler#> .

[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" . <#graph> rdf:type tdb:GraphTDB ; tdb:location "DB" .

but note that this graph is a single graph at that location; it is the default graph of a dataset. A particular named graph in the dataset at a location can be assembled with:
@prefix tdb: @prefix rdf: @prefix ja: <http://jena.hpl.hp.com/2008/tdb#> . <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . <http://jena.hpl.hp.com/2005/11/Assembler#> .

[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" . <#graph> rdf:type tdb:GraphTDB ; tdb:location "DB" ; tdb:graphName <http://example/graph1? ; .

It is also possible to describe a graph, or named graph, in a dataset where the dataset description can now be shared:
<#graph2> rdf:type tdb:GraphTDB ; tdb:dataset <#dataset> ; . <#dataset> rdf:type tdb:DatasetTDB ; tdb:location "DB" ; .

Mixed Datasets
It is possible to create a dataset with graphs backed by different storage subsystems, although query is not necessarily as efficient. To include as a named graph in a dataset use vocabulary as shown below:
@prefix tdb: @prefix rdf: @prefix rdfs: <http://jena.hpl.hp.com/2008/tdb#> . <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . <http://www.w3.org/2000/01/rdf-schema#> .

@prefix ja:

<http://jena.hpl.hp.com/2005/11/Assembler#> .

[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" . tdb:DatasetTDB rdfs:subClassOf ja:RDFDataset . tdb:GraphTDB rdfs:subClassOf ja:Model . # A dataset of one TDB-backed graph as the default graph and an in-memory graph as a named graph. <#dataset> rdf:type ja:RDFDataset ; ja:defaultGraph <#graph> ; ja:namedGraph [ ja:graphName <http://example.org/name1> ; ja:graph <#graph2> ] ; . <#graph> rdf:type tdb:GraphTDB ; tdb:location "DB" ; . <#graph2> rdf:type ja:MemoryModel ; ja:content [ja:externalContent <file:Data/books.n3> ] ; .

Note here we added:


tdb:DatasetTDB tdb:GraphTDB rdfs:subClassOf rdfs:subClassOf ja:RDFDataset . ja:Model .

which provides for integration with complex model setups, such as reasoners.

TDB/DynamicDatasets
From Jena Wiki Jump to: navigation, search (TDB version 0.8.5 and later) This feature allows a query to be made on a subset of all the named graphs in the TDB storage datasets. The SPARQL GRAPH pattern allows access to either a specific named graph or to all the named graph in a dataset. This feature means that only specified named graphs are visible to the query. SPARQL has the concept of a dataset description. In a query string, the clauses for FROM and FROM NAMED specify the dataset. The FROM clauses define the graphs that are merged to form the default graph, and the FROM NAMED clauses identify the graphs to be included as named graphs. Normally, ARQ interprets these as coming from the web, that is the graphs are read using HTTP GET. TDB modifies this behavior; instead of the universe of graphs being the web, the universe of graph is the TDB data store. FROM and FROM NAMED describe a dataset with graphs drawn only from the TDB data store.

Just using one or more FROM clauses, with no FROM NAMED in a query, leaves the named graphs as all the named graphs in the data store. Just using one or more FROM NAMED, with no FROM in a query, causes an empty default graph to be used. If the symbol TDB.symUnionDefaultGraph is also set, then the default graph is set union of all the named graphs (FROM NAMED) and the graphs already used for the default graph via FROM. urn:x-arq:UnionGraph and urn:x-arq:DefaultGraph explicitly name the union of named graphs (FROM NAMED) and the described default graph (union of <tt>FROM) directly.

# Follow a foaf:knows path across both Alice and Bobs FOAF data # where the data is in the datastore as named graphs. BASE <http://example> SELECT ?zName FROM <alice-foaf> FROM <bob-foaf>

{ <http://example/Alice#me> foaf:knows ?y . ?y foaf:knows ?z . ?z foaf:name ?zName . }

TDB/Datasets
From Jena Wiki Jump to: navigation, search

An RDF Dataset is a collection of one, unnamed, default graph and zero, or more named graphs. In a SPARQL query, a query pattern is matched against the default graph unless the GRAPH keyword is applied to a pattern.

Contents
[hide]

1 Dataset Storage 2 Dataset Query

Dataset Storage
One file location (directory) is used to store one RDF dataset. The unnamed graph of the dataset is held as a single graph while all the named graphs are held in a collection of quad indexes. Every dataset is obtained via TDBFactory.createDataset(Location) within a JVM is the same dataset. If a model is obtained from via TDBFactory.createModel(Location) there is a hidden, shared dataset and the appropriate model is returned.

Dataset Query
(TDB Version 0.7.0 and later) There is full support for SPARQL query over named graphs in a TDB-back dataset. All the named graphs can be treated as a single graph which is the union (RDF merge) of all the named graphs. This is given the special graph name <urn:x-arq:UnionGraph> in a GRAPH pattern. When querying the RDF merge of named graphs, the default graph in the store is not included. This feature applies to queries only. It does not affect the storage nor does it change loading. Alternatively, if the symbol tdb:unionDefaultGraph (see TDB Configuration) is set, the unnamed graph for the query is the union of all the named graphs in the datasets. The stored default graph is ignored and is not part of the the data of the union graph although it is accessible by the special name <urn:x-arq:DefaultGraph> in a GRAPH pattern. Special Graph Names URI
urn:x-arq:UnionGraph

Meaning The RDF merge of all the named graphs in the datasets of the query. The default graph of the dataset, used when the default graph of the query is the union graph.

urn:x-arq:DefaultGraph

Note that setting tdb:unionDefaultGraph does not affect the default graph or default model obtained with dataset.getDefaultModel(). The RDF merge of all named graph can be accessed as the named graph urn:x-arq:UnionGraph using Dataset.getNamedModel("urn:x-arq:UnionGraph") .

TDB/QuadFilter
From Jena Wiki Jump to: navigation, search (TDB version 0.8.7 and later) This page describes how to filter quads at the lowest level of TDB. It can be used to hide certain quads (tripes in named graphs) or triples. The code for the example on this page can be found in the TDB download: srcexamples/tdb.examples/ExQuadFilter.java Filtering quads should be used with care. The performance of the tuple filter callback is critical. See also Dynamic Datasets to select only certain specified named graphs for a query. TDB will call a registered filter on every quad that it retrieves from any of the indexes, both quads (for named graphs) and triples (for the stored default graph). This filter indicates whether to accept or reject the quad or triple. This happens during basic graph pattern processing. A rejected quad is simply no processed further in the basic graph patten and it is as if it is not in the dataset. The filter has a signature of:
// org.openjena.atlas.iterator.Filter interface Filter<T> { public boolean accept(T item) ; }

with a type parameter of Tuple<NodeId>. NodeId is the low level internal identifier TDB uses for RDF terms. Tuple is a class for a immutable tuples of values of the same type.
/** Create a filter to exclude the graph http://example/g2 */ private static Filter<Tuple<NodeId>> createFilter(Dataset ds) { DatasetGraphTDB dsg = (DatasetGraphTDB)(ds.asDatasetGraph()) ; NodeTable nodeTable = dsg.getQuadTable().getNodeTupleTable().getNodeTable() ; // Filtering operates at a very low level: // need to know the internal identifier for the graph name. final NodeId target = nodeTable.getNodeIdForNode(Node.createURI("http://example/g2")) ; // Filter for accept/reject as quad as being visible. Filter<Tuple<NodeId>> filter = new Filter<Tuple<NodeId>>() { public boolean accept(Tuple<NodeId> item) { // Quads are 4-tuples, triples are 3-tuples. if ( item.size() == 4 && item.get(0).equals(target) ) // reject return false ; // Accept return true ; } } ; return filter ; }

To install a filter, put it in the context of a query execution under the symbol SystemTDB.symTupleFilter

Dataset ds = ... ; Filter<Tuple<NodeId>> filter = createFilter(ds) ; Query query = ... ; QueryExecution qExec = QueryExecutionFactory.create(query, ds) ; qExec.getContext().set(SystemTDB.symTupleFilter, filter) ;

then execute the query as normal.

TDB/ValueCanonicalization
From Jena Wiki Jump to: navigation, search TDB canonicalizes certain XSD datatypes. The value of literals of these datatypes is stored, not the original lexical form. For example, "01"^^xsd:integer, "1"^^xsd:integer and "+001"^^xsd:integer are all the same value and are stored as the same RDF literal. In addition, derived types for integers are also understood by TDB. For example, "01"^^xsd:integer and "1"^^xsd:byte are the same value. When RDF terms for these values are returned, the lexical form will be the canonical representation. Only certain ranges of values are directly encoded as values. If a literal is outside the canonicalization range, its lexical representation is stored. TDB transparently switches between value and non-value based literals in graph matching and filter expressions; non-canonicalized and canonicalized values will be compared as needed. (Future versions of TDB may increase the ranges canonicalized.) The datatypes canonicalized by TDB are:

XSD decimal (canonicalized range: 8 bits of scale, signed 48 bits of value) XSD integer (canonicalized range: 56 bits) XSD dateTime (canonicalized range: 0 to the year 8000, millisecond accuracy, timezone to 15 minutes). XSD date (canonicalized range: 0 to the year 8000, timezone to 15 minutes). XSD boolean (canonicalized range: true and false)

TDB/Architecture
From Jena Wiki Jump to: navigation, search This page gives an overview of the TDB architecture. Specific details refer to TDB 0.8.

Contents
[hide]

1 Terminology 2 Design o 2.1 The Node Table o 2.2 Triple and Quad indexes o 2.3 Prefixes Table o 2.4 TDB B+Trees 3 Inline values 4 Query Processing 5 Caching on 32 and 64 bit Java systems

Terminology

Terms like "table" and "index" are used in this description. They don't directly correspond to concepts in SQL, For example, in SQL terms, there is no triple table; that can be seen as just having indexes for the table or, alternatively, there are 3 tables, each of which has a primary key and TDB manages the relationship between them.

Design
A dataset backed by TDB is stored in a single directory in the filing system. A dataset consists of

The node table Triple and Quad indexes The prefixes table

The Node Table


The node table stores the representation of RDF terms (except for inlined value - see below). It provides two mappings from Node to NodeId and from NodeId to Node. This is sometimes called a dictionary. The Node to NodeId mapping is used during data loading and when converting constant terms in queries from their Jena Node representation to the TDB-specific internal ids. The NodeId to Node mapping is used to turn query results expressed as TDB NodeIds into the Jena Node representation and also during query processing when filters are applied if the whole node representation is needed for testing (e.g. regex). Node table implementations usually provide a large cache - the NodeId to Node mapping is heavily used in query processing yet the same NodeId can appear in many query results.
NodeIds

are 8 byte quantities. The Node to NodeId mapping is based on hash of the Node (a 128 bit MD5 hash - the length was found not to major performance factor). The default storage of the node table is a sequential access file for the NodeId to Node mapping and a B+Tree for the Node to NodeId mapping.

Triple and Quad indexes


Quads are used for named graphs, triples for the default graph. Triples are held as 3-tuples of NodeIds in triple indexes - quads as 4-tuples. Otherwise they are handled in the same manner. The triple table is 3 indexes - there is no distinguished triple table with secondary indexes. Instead, each index has all the information about a triple. The default storage of each indexes

Prefixes Table
The prefixes table uses a node table and a index for GPU (Graph->Prefix->URI). It is usually small. It does not take part in query processing. It provides support for Jena's PrefixMappings used mainly for presentation and for serialisation of triples in RDF/XML or Turtle.

TDB B+Trees
Many of the persistent datastructures in TDB use a custom implementation of threaded B+Trees. The TDB implementation only provides for fixed length key and fixed length value. There is no use of the value part in triple indexes. The threaded nature means that long scans of indexes proceeds without needing to traverse the branches of the tree.

See the description of index caching below.

Inline values
Values of certain datatypes are held as part of the NodeId in the bottom 56 bits. The top 8 bits indicates the type external NodeId or the value space. The value spaces handled are (TDB 0.8): xsd:decimal, xsd:integer, xsd:dateTime, xsd:date and xsd:boolean. Each has it's own encoding to fit in 56 bits. If a node falls outside of the range of values that can be represented in the 56 bit encoding. The xsd:dateTime and xsd:date ranges cover about 8000 years from year zero with a precision down to 1 millisecond. Timezone information is retained to an accuracy of 15 minutes with special timezones for Z and for no explicit timezone. By storing the value, the exact lexical form is not recorded. The integers 01 and 1 will both be treated as a the value 1. Derived XSD datatypes are held as their base type. The exact datatype is not retained; the value of the RDF term is.

Query Processing
TDB uses the OpExecutor extension point of ARQ. TDB provides low level optimization of basic graph patterns using a statistics based optimizer.

Caching on 32 and 64 bit Java systems


TDB runs on both 32-bit and 64-bit Java Virtual Machines. The same file formats are used on both systems and database files can be transferred between architectures (no TDB system should be running for the the database at the time of copy). What differs is the file access mechanism used. TDB is faster on a 64 bit JVM because more memory is available for file caching. The node table caches are always in the Java heap. The file access mechanism can be set explicitly, but this is not a good idea for production usage, only for experimentation - see the File Access mode option. On 64-bit Java, TDB uses memory mapped files, accessed 8M segments, and the operating system handles caching between RAM and disk. The amount of RAM used for file caching increases and decreases as other application run on the machine. The fewer other programs running on the machine, the more RAM will be available for file caching. The mapped address space counts as part of the application processes memory usage but this space is not part of the Java heap. On a 32 bit JVM, this approach does not work because Java addressing is limited to about 1.5Gbytes (the exact figure is JVM specifica and includes any memory mapped file usage) and this would limit the size of TDB datasets. Instead, TDB provides an in-heap LRU cache of B+Tree blocks. Applications should set the JVM heap to 1G or above (within the JVM speciifc limit). On 32-bit Java, TDB uses it's own file caching to enable large databases. 32-bit Java limits the address space of the JVM to about 1.5Gbytes (the exact size is JVM-dependent), and this includes memory mapped files, even though they are not in the Java heap. The JVM heap size may need to be increased to make space for the disk caches used by TDB.

TDB/Optimizer

From Jena Wiki Jump to: navigation, search Query execution in TDB involves both static and dynamic optimizations. Static optimizations are transformations of the SPARQL algebra performed before query execution begins; dynamic optimizations involve deciding the best execution approach during the execution phase and can take into account the actual data so far retrieved. The optimizer has a number of strategies: a statistics based strategy, a fixed strategy and a strategy of no reordering. For the preferred statistics strategy, the TDB optimizer uses information captured in a per-database statistics file. The file takes the form of a number of rules for approximate matching counts for triple patterns. The statistic file can be automatically generated. The user can add and modify rules to tune the database based on higher level knowledge, such as inverse function properties.

Contents
[hide]

1 Quickstart 2 Choosing the optimizer strategy 3 Filter placement 4 Investigating what is going on 5 Statistics Rule File o 5.1 Statistics Rule Language o 5.2 Abbreviated Rule Form o 5.3 Defaults 6 Generating a statistics file 7 Writing Rules

Feedback on the effects, good and bad, of the TDB optimizer would be appeciated. Mail to jena-dev.

Quickstart
This section provides a practical how-to. 1. Load data. 2. Generate the statistics file. Run tdbconfig stats. 3. Place the file generated in the database directory with the name stats.opt.

Choosing the optimizer strategy


TDB chooses the basic graph pattern optimizer by the presence of a file in the database directory. Optimizer control files File name
none.opt

Effect No reordering - execute triple patterns in the order in the query Use a built-in reordering based on the number of variables in a triple pattern. The contents of this file are the weighing rules (see below).

fixed.opt

stats.opt

The contents of the files none.opt and fixed.opt are not read and don't matter. They can be zero-length files.

If more then one file is found, the choice is made: stats.opt over fixed.opt over none.opt. The "no reorder" strategy can be useful in investigating the effects. Filter placement still takes place.

Filter placement
One of the key optimization is of filtered basic graph patterns. This optimization decides the best order of triple patterns in a basic graph pattern and also the best point at which to apply the filters within the triple patterns. Any filter expression of a basic graph pattern is placed immediately after all it's variables will be bound. Conjunctions at the top level in filter expressions are broken into their constituent pieces and placed separately.

Investigating what is going on


TDB can optionally log query execution details. This is controlled by two setting: the logging level and a context setting. Having two setting means it is possible to log some queries and not others. The logger used is called com.hp.hpl.jena.tdb.exec. Message are sent at level "info". So for log4j, the following can be set in the log4j.properties file:
log4j.logger.com.hp.hpl.jena.tdb.exec=INFO

The context setting is for key (Java constant) TDB.symLogExec. To set globally:
TDB.getContext().set(TDB.symLogExec,true) ;

and it may also be set on an individual query execution using it's local context.
QueryExecutiuon qExec = QueryExecutionFactory.create(...) ; qExec.getContext().set(TDB.symLogExec,true) ;

On the command line:


tdbquery --set tdb:logExec=true --file queryfile

TDB version 0.8.3 provides more fine-grained logging controls. Instead of "true", which sets all levels, the following can be used: Explanation Levels Level INFO FINE ALL NONE Log each query Log each query and it's algebra form after optimization Log query, algebra and every database access (can be expensive) No information logged Effect

These can be specified as string, to the command line tools, or using the constants in Explain.InfoLevel.
qExec.getContext().set(TDB.symLogExec,Explain.InfoLevel.FINE) ;

Statistics Rule File


The syntax is SSE, a simple format that uses Turtle-syntax for RDF terms, keywords for other terms (for example, the stats marks a statistics data structure), and forms a tree data structure. The structure of a statistics file takes the form:
(prefix ... (stats (meta ...) rule rule ))

that is, a meta block and a number of pattern rules. A simple example:
(prefix ((: <http://example/)) (stats (meta (timestamp "2008-10-23T10:35:19.122+01:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>) (run@ "2008/10/23 10:35:19") (count 11)) (:p 7) (<http://example/q> 7) ))

This example statistics file contains some metadata about statistics (time and date the file was generated, size of graph), the frequence count for two predicates http://example/p (written using a prefixed name) and http://example/q (written in full). The numbers are the estimated counts. They do not have to be exact - they guide the optimizer in choosing one execution plan over another. They do not have to exactly up-to-date providing the relative counts are representative of the data. Statistics Rule Language A rule is made up of a triple pattern and a count estimation for the approximate number of matches that the pattern will yeild. This does have to be exact, only an indication. In addition, the optimizer considers which variables will be bound to RDF terms by the time a triplepatetrn is reached in the execution plan being considered. For example, in the basic graph pattern:
{ ?x ?x } :identifier :name 1234 . ?name .

then ?x will be bound in pattern ?x :name ?name to an RDF term if executed after the pattern ?x :identifier 1234. A rule is of the form:
( (subj pred obj) count)

where subj, pred, obj are either RDF terms or one of the tokens in the following table:

Statistic rule tokens Token


TERM VAR URI LITERAL BNODE ANY

Description Matches any RDF term (URI, Literal, Blank node) Matches a named variable (e.g. ?x) Matches a URI Matches an RDF literal Matches an RDF blank node (in the data) Matches anything - a term or variable

From the example above, (VAR :identifier TERM) will match ?x :identifier 1234.
(TERM :name VAR) will match ?x :name ?name when in a potential plan where the :identifier because ?x will be a bound term at that point but not if this triple pattern is considered first.

triple pattern is first

When searching for a weighting of a triple pattern, the first rule to match is taken. The rule which says an RDF graph is a set of triples:
((TERM TERM TERM) 1)

is always implicitly present. does not match a blank node in the query (which is a variable and matches VAR) but in the data, if it is known that slot of a triple pattern is a blank node.
BNODE

Abbreviated Rule Form While a complete rule is of the form:


( (subj pred obj) count)

there is an abbreviated form:


(prediate count)

The abbreviated form is equivalent to writing:


((TERM predicate ANY) X) ((ANY predicate TERM) Y) ((ANY predicate ANY) count)

where for small graphs (less that 100 triples) X=2, Y=4 but Y=40 if the predicate is rdf:type and 2, 10, 1000 for large graphs. Use of "VAR rdf:type Class" can be a quite unselective triple pattern and so there is a preference to move it later in the order of execution to allow more selective patterns reduce the set of possibilities first. The astute reader may notice that ontological information may render it unnecessary (the domain or range of another property implies the class of some resource). TDB does not currently perform this optimization. These number are merely convenient guesses and the application can use the full rules language for detailed control of pattern weightings.

Defaults A rule of the form:


(other number)

is used when no matches from other rules (abbreviated or full) when matching a triple pattern that has a URI in the predicate position. If a rule of this form is absent, the default is to place the triple pattern after all known triple patterns; this is the same as specifying -1 as the number. To declare that the rules are complete and no other predicates occur in the data, set this to 0 (zero) because the triple pattern can not match the data (the predicate does not occur).

Generating a statistics file


The command line tdbstats will scan the data and produce a rules file based on the frequency of properties. The output should first go to a temporary file, then that file moved into the database location. Practical tip: Don't feed the output of this command directly to location/stats.opt because when the command starts it will find an empty statistics file at that location.

Writing Rules
Rule for an inverse functional property:
((VAR :ifp TERM) 1 )

and even if a property is only approximately identifying for resources (e.g. date of birth in a small dataset of people), it is useful to indicate this. Because the counts needed are only approximations so the optimizer can choose one order over another, and does not need to predicate exact counts, rules that are usually right but may be slightly wrong are still useful overall. Rules involving rdf:type can be useful where they indicate whether a partciualr class is common or not. In some datasets
((VAR rdf:type class) ...)

may help little because a property whose domain is that class, or a subclass, may be more elective. SO a rule like:
((VAR :property VAR) ...)

is more selective. In other datasets, there may be many classes, each with a small number of instances, in which case
((VAR rdf:type class) ...)

is a useful selective rule.

TDB/Joseki Integration
From Jena Wiki Jump to: navigation, search Joseki uses an RDF dataset description. Using a TDB graph in a Joseki server instance is a matter of putting the graph in a dataset as in the example below where it is the default graph of the dataset. Full assembler details on the TDB assembler description page. A simple example that publishes one TDB-backed dataset at a SPARQL end-point:

[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" . tdb:DatasetTDB rdfs:subClassOf ja:RDFDataset . tdb:GraphTDB rdfs:subClassOf ja:Model . <#dataset> rdf:type tdb:location "DB" ; . tdb:DatasetTDB ;

jena - Revision 8120: /TDB


.. branches/ tags/ trunk/

Powered by Subversion version 1.6.9 (r901367).

SDB
From Jena Wiki Jump to: navigation, search SDB is a component of Jena for the RDF storage and query specifically to support SPARQL. The storage is provided by an SQL database and many databases are supported, both Open Source and proprietary. An SDB store can be accessed and managed with the provided command line scripts and via the Jena API.

Contents
[hide]

1 Downloads 2 Documentation 3 Subversion 4 Support 5 Details 6 Database Notes

Downloads
SDB is distributed from the Jena project SourceForge. SDB Download area

Documentation

SDB Installation Quickstart Command line utilities Store Description format Dataset And Model Descriptions Use from Java Specialized configuration Database Layouts FAQ Joseki Integration Databases supported

Subversion
SDB Subversion repository on SourceForge.

Support
Support and questions

Details

Loading data Loading performance Query Query performance

Database Notes

Databases supported PostgreSQL MySQL Oracle Microsoft SQL Server DB2 Derby HSQLDB H2

SDB/Installation
From Jena Wiki Jump to: navigation, search A suitable database must be installed separately. Any database installation should be tuned according to the database documentation. The SDB distribution is zip file of a directory hierarchy. Unzip this. You may need to run chmod u+x on the scripts in the bin/ directory. Write a sdb.ttl store description: there are examples in the Store/ directory. A database must be created before the tests can be run. Microsoft SQL server and PostgreSQL need specific database options set when a database is created. To use in a Java application, put all the jar files in lib/ on the build and classpath of your application. See the Java API. To use command line scripts, see the scripts page including setting environment variables SDBROOT, SDB_USER, SDB_PASSWORD and SDB_JDBC.
bin/sdbconfig --sdb=sdb.ttl --create

and run the test suite:


bin/sdbtest --sdb=sdb.ttl testing/manifest-sdb.ttl

SDB/Quickstart
From Jena Wiki Jump to: navigation, search SDB provides some command line tools to work with SDB triple stores. In the following it assumed that you have a store description set up for your database (sdb.ttl). See the store description format for details. The Store/ directory for some examples.

$ $ $ $ $

Setting up your environment


SDBROOT=/path/to/sdb PATH=$SDBROOT/bin:$PATH SDB_USER=YourDatabaseUserName SDB_PASSWORD=YourDatabasePassword SDB_JDBC=YourJDBCdriver

export export export export export

Initialising the database

Be aware that this will wipe existing data from the database.
$ sdbconfig --sdb sdb.ttl --format

This creates a basic layout. It does not add all indexes to the triple table, which may be left until after loading.

Loading data

$ sdbload --sdb sdb.ttl file.rdf

You might want to add the --verbose flag to show the load as it progresses.

Adding indexes

You need to do this at some point if you want your queries to execute in a reasonable time.
$ sdbconfig --sdb sdb.ttl --index

Query

$ sdbquery --sdb sdb.ttl 'SELECT * WHERE { ?s a ?p }' $ sdbquery --sdb sdb.ttl --file query.rq

SDB/Commands
From Jena Wiki Jump to: navigation, search This page describes the command line programs that can be used to create an SDB store, load data into it and to issue queries.

Contents
[hide]

1 Scripts o 1.1 Script set up o 1.2 Argument Structure 2 Store Description o 2.1 Modifying the Store Description o 2.2 Logging and Monitoring 3 SDB Commands o 3.1 Database creation o 3.2 Loading data o 3.3 Query o 3.4 Testing o 3.5 Other

Scripts
The directory bin/ contains shell scripts to run the commands from the command line. The scripts are bash scripts which also run over Cygwin. Script set up Set the environment variable SDBROOT to the root of the the SDB installation. A store description can include naming the class for the JDBC driver. Getting a Store object from a store description will automatically load the JDBC driver from the classpath. When running scripts, set the environment variable SDB_JDBC to one or more jar files for JDBC drivers. If it is more than one jar file, use the classpath syntax for your system. You can also use the system property jdbc.drivers. Set the environment variables SDB_USER and SDB_PASSWORD to the database user name and password for JDBC.
$ $ $ $ export export export export SDBROOT="/path/to/sdb SDB_USER="YourDbUserName" SDB_PASSWORD="YourDbPassword" SDB_JDBC="/path/to/driver.jar"

They are bash scripts, and work on Linux and Cygwin for MS Windows.
$ export PATH=$SDBROOT/bin:$PATH

Alternatively, there are wrapper scripts in $SDBROOT/bin2 which can be placed in a convenient directory that is already on the shell command path. Argument Structure All commands take a SDB store description to extract the connection and configuration information they need. This is written SPEC in the command descriptions below but it can be composed of several arguments as described here. Each command then has command-specific arguments described below. All commands support --help to give details of named and positional arguments. There are two equivalent forms of named argument syntax:
--arg=val --arg val

Store Description
If this is not specified, commands load the description file sdb.ttl from the current directory.
--sdb=<sdb.ttl>

This store description is a Jena assembler file. The description consists of two parts; a store description and a connection description. Often, this is all that is needed to describe which store to use. The individual components of a connection or configuration can be overridden after the description have been read, before it is processed. The directory Store/ has example assmembler files. The full details of the assembler file is given in 'SDB/Store Description' Modifying the Store Description The individual items of a store description can be overridden by various command arguments. The description in the assembler file is read, then any comand lien arguments used to modify the description, then the appropriate object is created from the modified description. Set the layout type:
--layout : layout name

Currently, one of layout1, layout2, layout2/index, layout2/hash. Set JDBC details:


--dbName : Database Name --dbHost : Host machine name --dbType : Database type. --dbUser : Database use --dbPassword : Database password.

Th host name can host or host:port. The better way to handle passwords is to use environment variables SDB_USER and SDB_PASSWORD because then the user/password is not stored in a visible way. Logging and Monitoring All commands take the following arguments (although they may do nothing if they make no sense to the command).
-v

Be verbose.
--time

Print timing information. Treat with care - while the timer avoids recording JVM and some class loading time, it can't avoid all class loading. Hence, the values of timing are more meaningful on longer operations. JDBC operation times to a remote server can also be a significant proportion in short operations.
--log=[all|none|queries|statements|exceptions]

to log SQL actions on the database connection (but not the prepared statements used by the loader). Can be repeated on the command line.

SDB Commands
Database creation
sdbconfig SPEC [--create|--format|--indexes|--dropIndexes]

Setup a database.
--create

formats the store and sets up indexes just formats the store and creates indexes for loading, not querying. Create indexes for querying Drop indexes for querying.

--format

--indexes

--dropIndexes

Loading large graphs can be faster by formatting, loading the data, then building the query indexes with this command.
sdbtruncate SPEC

Truncate the store. Non-transactional. Destroys data. Loading data


sdbload SPEC FILE [FILE ...]

Load RDF data into a store using the SDB bulk loader. Data is streamed into the database and is not loaded as a single transaction. The file's extension is used to determine the data syntax. To load into a named graph:
sdbload SPEC --graph=URI FILE [FILE ...]

Query
sdbquery SPEC --query=FILE

Execute a query.
sdbprint SPEC --print=X [--sql] --query=FILE

Print details of a query. X is any of query, op, sqlNode, sql or plan. --print=X can be repeated. --sql is short for -print=sql. The default is --print=sql. Testing
sdbtest SPEC MANIFEST

Execute a test manifest file. The manifest of all query tests, which will test connection and loading of data, is in SDBROOT/testing/manifest-sdb.ttl. Other
sdbdump SPEC --out=SYNTAX

Dump the contents of a store N-TRIPLES or a given serialization format (usual Jena syntax names, e.g. Turtle or TTL).

Only suitable for data sizes that fit in memory. All output syntaxes that do some form of pretty printing will need additional space for their internal datastructures.
sdbsql SPEC [ --file=FILE | SQL string ]

Execute a SQL command on the store, using the connection details from the store specification. The SQL command either comes from file FILE or the command line as a string.
sdbinfo SPEC

Details of a store.
sdbmeta SPEC --out=SYNTAX

Do things with the meta graphs of a store.


sdbscript SPEC FILE

Execute a script. Currently only JRuby is supported.


sdbtuple SPEC [--create|--print|--drop|--truncate] tableName

Many of the tables used within SDB are tuples of RDF nodes. This command allows low-level access to these tuple tables. Misuse of this command can corrupt the store.

SDB/Store Description
From Jena Wiki Jump to: navigation, search Use of an SDB store requires a Store object which is described in two parts:

a connection to the database a description of the store configuration

These can be built from a Jena assembler description. Store objects themselves are lightweight so connections to an SDB database can be created on a per-request basis as required for use in J2EE application servers.

Contents
[hide]

1 Store Descriptions 2 SDB Connections 3 Example 4 Vocabulary o 4.1 Store o 4.2 Connection

Store Descriptions
A store description identifies which storage layout is being used, the connection to use and the database type.
[] rdf:type sdb:Store ; sdb:layout "layout2" ; sdb:connection <#conn> . <#conn> ...

SDB Connections
SDB connections, objects of class SDBConnection, abstract away from the details of the connection and also provide consist logging and transaction operations. Currently, SDB connections encapsulate JDBC connections but other connection technologies, such as direct database APIs, can be added.

Example
The sdbType is needed for both a connection and for a store description. It can be given in either part of the complete store description. If it is specified in both places, it must be the same.
@prefix @prefix @prefix @prefix rdfs: rdf: ja: sdb: <http://www.w3.org/2000/01/rdf-schema#> . <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . <http://jena.hpl.hp.com/2005/11/Assembler#> . <http://jena.hpl.hp.com/2007/sdb#> .

<#myStore> rdf:type sdb:Store ; sdb:layout "layout2" ; sdb:connection <#conn> ; . <#conn> rdf:type sdb:SDBConnection ; sdb:sdbType "derby" ; sdb:sdbName "DB/SDB2" ; sdb:driver "org.apache.derby.jdbc.EmbeddedDriver" ; .

Examples of assembler files are to be found in the Store/ directory in the distribution.

Vocabulary
Store The value of sdbType needed for the connection also applies to choosing the store type.
layout

Layout type (e.g. "layout2", "layout2/hash" or "layout2/index").


connection

The object of this triple is the subject of the connection description.


engine

Set the MySQL engine type (MySQL only). Connection


sdbType

The type of the database (e.g. "oracle", "MSSQLServerExpress", "postgresql", "mysql"). Controls both creating the JDBC URL, if not given explicitly, and the store type.
sdbName

Name used by the database service to select a database. Oracle SID.


sdbHost

Host name for the database server. Include :port to change the port from the default for the database.
sdbUser sdbPassword

Database user name and password. The environment variables SDB_USER and SDB_PASSWORD are a better way to pass in the user and password because they are not then written into store description files. In Java programs, the system properties jena.db.user and jena.db.password can be used.
driver

The JDBC driver class name. Normally, the system looks up the sdbType to find the driver. Setting this property overrides that choice.
jdbcURL

If necessary, the JDBC URL can be set explicitly, not constructed by SDB. The sdbType is still needed.

SDB/Store Description
From Jena Wiki Jump to: navigation, search Use of an SDB store requires a Store object which is described in two parts:

a connection to the database a description of the store configuration

These can be built from a Jena assembler description. Store objects themselves are lightweight so connections to an SDB database can be created on a per-request basis as required for use in J2EE application servers.

Contents
[hide]

1 Store Descriptions 2 SDB Connections 3 Example 4 Vocabulary o 4.1 Store o 4.2 Connection

Store Descriptions
A store description identifies which storage layout is being used, the connection to use and the database type.
[] rdf:type sdb:Store ; sdb:layout "layout2" ; sdb:connection <#conn> . <#conn> ...

SDB Connections
SDB connections, objects of class SDBConnection, abstract away from the details of the connection and also provide consist logging and transaction operations. Currently, SDB connections encapsulate JDBC connections but other connection technologies, such as direct database APIs, can be added.

Example
The sdbType is needed for both a connection and for a store description. It can be given in either part of the complete store description. If it is specified in both places, it must be the same.
@prefix @prefix @prefix @prefix rdfs: rdf: ja: sdb: <http://www.w3.org/2000/01/rdf-schema#> . <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . <http://jena.hpl.hp.com/2005/11/Assembler#> . <http://jena.hpl.hp.com/2007/sdb#> .

<#myStore> rdf:type sdb:Store ; sdb:layout "layout2" ; sdb:connection <#conn> ; . <#conn> rdf:type sdb:SDBConnection ; sdb:sdbType "derby" ; sdb:sdbName "DB/SDB2" ; sdb:driver "org.apache.derby.jdbc.EmbeddedDriver" ; .

Examples of assembler files are to be found in the Store/ directory in the distribution.

Vocabulary
Store The value of sdbType needed for the connection also applies to choosing the store type.
layout

Layout type (e.g. "layout2", "layout2/hash" or "layout2/index").


connection

The object of this triple is the subject of the connection description.


engine

Set the MySQL engine type (MySQL only). Connection


sdbType

The type of the database (e.g. "oracle", "MSSQLServerExpress", "postgresql", "mysql"). Controls both creating the JDBC URL, if not given explicitly, and the store type.
sdbName

Name used by the database service to select a database. Oracle SID.


sdbHost

Host name for the database server. Include :port to change the port from the default for the database.
sdbUser sdbPassword

Database user name and password. The environment variables SDB_USER and SDB_PASSWORD are a better way to pass in the user and password because they are not then written into store description files. In Java programs, the system properties jena.db.user and jena.db.password can be used.
driver

The JDBC driver class name. Normally, the system looks up the sdbType to find the driver. Setting this property overrides that choice.

jdbcURL

If necessary, the JDBC URL can be set explicitly, not constructed by SDB. The sdbType is still needed.

SDB/JavaAPI
From Jena Wiki Jump to: navigation, search This page describes how to use SDB from Java. Code examples are in src-examples/ in the SDB distribution.

Contents
[hide]

1 Concepts 2 Obtaining the Store o 2.1 From a configuration file o 2.2 In Java code o 2.3 Database User and Password 3 Connection Management 4 Formatting or Emptying the Store 5 Loading data 6 Executing Queries 7 Using the Jena Model API with SDB

Concepts

Store SDBFactory SDBConnection

SDB loads and queries data based on the unit of a Store. The Store object has all the information for formatting, loading and accessing an SDB database. One database or table space is one Store. Store objects are made via the static method of the StoreFactory class.
SDBConnection

wraps the underlying database connection, as well as providing logging operations.

StoreDesc

A store description is the low level mechanism for describing stores to be created.

DatasetStore GraphSDB

Two further class are not immediately visible because they are managed by the SDBFactory which creates the necessary classes, such as Jena models and graphs. An object of class DatasetStore represents an RDF dataset backed by an SDB store. Objects of this class trigger SPARQL queries being sent to SDB. The class GraphSDB provides the adapter between the standard Jena Java API and an SDB store, either to the default graph or one of the named graphs. The SDBFactory can also create Jena Models backed by such a graph.

Obtaining the Store


A store is build from a description. This can be a description in file as a Jena assembler or the application can build the store description programmatically. From a configuration file The stored description is the only point where the specific details of store are given. This includes connection information, the database name, and database type. It makes sense to place this outside the code. That way, the application can be switched between different databases (e.g. testing and production) by changing a configuration file, and not the code, which would require recompilation and a rebuild. To create a Store from a store assembler
Store store = SDBFactory.connectStore("sdb.ttl") ;

The assembler file has two parts, the connection details and the store type.
@prefix @prefix @prefix @prefix rdfs: rdf: ja: sdb: <http://www.w3.org/2000/01/rdf-schema#> . <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . <http://jena.hpl.hp.com/2005/11/Assembler#> . <http://jena.hpl.hp.com/2007/sdb#> .

_:c rdf:type sdb:SDBConnection ; sdb:sdbType "derby" ; sdb:sdbName "DB/SDB2" ; sdb:driver "org.apache.derby.jdbc.EmbeddedDriver" ; . [] rdf:type sdb:Store ; sdb:layout "layout2" ; sdb:connection _:c ; .

See the full details of store description files for the options. In Java code The less flexible way to create a store description is to build it in Java. For example:
StoreDesc storeDesc = new StoreDesc(LayoutType.LayoutTripleNodesHash, DatabaseType.Derby) ; JDBC.loadDriverDerby() ; String jdbcURL = "jdbc:derby:DB/SDB2"; SDBConnection conn = new SDBConnection(jdbcURL, null, null) ; Store store = SDBFactory.connectStore(conn, storeDesc) ;

Database User and Password The user and password for the database can be set in explicitly in the description file but it is usually better to use an environment variable or Java system property because this avoid writing use and password in a file. Environment variable: SDB_USER Java property: jena.db.user Environment variable: SDB_PASSWORD Java property: jena.db.password

Connection Management
Each store has a JDBC connection associated with it.

In situations where such connections are managed externally, the store object can be created and used within a single operation. A Store is lightweight and does not perform any database actions when created, so creating and releasing them will not impact performance. Closing a store does not close the JDBC connection. Similarly, a SDBConnection is lightweight and creation does not result in any database or JDBC connection actions. The store description can be read from the same file because any SDB connection information is ignored when reading to get just the store description. The store description can be kept across store creations:
storeDesc = StoreDesc.read("sdb.ttl") ;

then used with an JDBC connection object passed from the connection container:
public static void query(String queryString, StoreDesc storeDesc, Connection jdbcConnection) { Query query = QueryFactory.create(queryString) ; SDBConnection conn = SDBFactory.createConnection(jdbcConnection) ; Store store = SDBFactory.connectStore(conn, storeDesc) ; Dataset ds = SDBFactory.connectDataset(store) ; QueryExecution qe = QueryExecutionFactory.create(query, ds) ; try { ResultSet rs = qe.execSelect() ; ResultSetFormatter.out(rs) ; } finally { qe.close() ; } store.close() ; }

Formatting or Emptying the Store


SDB stores do not ensure that the database is formatted. You can check whether the store is already formatted using:
StoreUtils.isFormatted(store);

This is an expensive operation, and should be used sparingly. Once you obtain a store for the first time you will need to:
store.getTableFormatter().create();

This will create the necessary tables and indexes required for a full SDB store. You may empty the store completely using:
store.getTableFormatter().truncate();

Loading data
Data loading uses the standard Jena Model.read operations. GraphSDB, and models made from a GraphSDB, implement the standard Jena bulk data interface with backed by an SBD implementation of that interface.

Executing Queries
The interface to making queries with SDB is same as that for querying with ARQ. SDB is an ARQ query engine that can handle queries made on an RDF dataset which is of the SDB class DatasetStore:
Dataset ds = DatasetStore.create(store) ;

This is then used as normal with ARQ:


Dataset ds = DatasetStore.create(store) ; QueryExecution qe = QueryExecutionFactory.create(query, ds) ; try { ResultSet rs = qe.execSelect() ; ResultSetFormatter.out(rs) ; } finally { qe.close() ; }

When finished, the store should be closed to release any resources associated with the particualr implementation. Closing a store does not close it's JDBC connection.
store.close() ;

Closing the SDBConnection does close the JDBC connection:


store.getConnection().close() ; store.close() ;

If models or graphs backed by SDB are placed in a general Dataset then the query is not efficiently executed by SDB.

Using the Jena Model API with SDB


A Jena model can be connected to one graph in the store and used with all the Jena API operations. Here, the graph for the model is the the default graph:
Store store = SDBFactory.connectStore("sdb.ttl") ; Model model = SDBFactory.connectDefaultModel(store) ; StmtIterator sIter = model.listStatements() ; for ( ; sIter.hasNext() ; ) { Statement stmt = sIter.nextStatement() ; System.out.println(stmt) ; } sIter.close() ; store.close() ;

SDB is optimized for SPARQL queries but queries and other Jena API operations can be mixed. The results from a SPARQL query are Jena RDFNodes, with the associated model having a graph implemented by SDB.

SDB/Configuration
From Jena Wiki Jump to: navigation, search This page describe the configuration options available. These are options for query processing, not for the database layout and storage, which is controlled by store descriptions.

Contents
[hide]

1 Setting Options 2 Current Options o 2.1 Queries over all Named Graphs o 2.2 Streaming over JDBC o 2.3 Annotated SQL

Setting Options
Options can be set globally, through out the JVM, or on a per query execution basis. SDB uses the same mechanism as ARQ. There is a global context, which is give to each query query execution as it is created. Modifications to the global context after the query execution is created are not seen by the query execution. Modifications to the context of a single query execution do not affect any other query execution nor the global context. A context is a set of symbol/value pairs. Symbols are used created internal to ARQ and SDB and access via Java constants. Values are any Java object, together with the values true and false, which are short for the constants of class java.lang.Boolean. Setting globally:
SDB.getContext().set(symbol, value) ;

Per query execution:


QueryExecution qExec = QueryExecutionFactory.create(...) ; qExec.getContext.set(symbol, value) ;

Setting for a query execution happens before any query compilation or setup happens. Creation of a query execution object does not compile the query, which happens when the appropriate .exec method is called.

Current Options
Symbol Effect Query patterns on the default graph match against the union of the named graphs. Attempt to stream JDBC results. Set the JDBC fetch size on the SQL query statements. Must be >= 0. Stream Jena APIs (also requires jdbcStream and jdbcFetchSize Put SQL comments in SQL Default

SDB.unionDefaultGraph

false

SDB.jdbcStream

true

SDB.jdbcFetchSize

unset

SDB.streamGraphAPI

false true

SDB.annotateGeneratedSQL

Queries over all Named Graphs experimental feature All the named graphs can be treated as a single graph in two ways: either set the SDB option above or use the URI that refers to RDF merge of the named graphs (urn:x-arq:UnionGraph). When querying the RDF merge of named graph, the default graph in the store is not included. This feature applies to queries only. It does not affect the storage nor does it change loading. To find out which named graph a triple can be found in, use GRAPH as usual.

The following special IRIs exist for use as a graph name in GRAPH only:

<urn:x-arq:DefaultGraph> the default graph, even when option for named union queries is set. <urn:x-arq:UnionGraph> the union of all named graphs, even when the option for named union

queries is not set. Streaming over JDBC By default, SDB processes results from SQL statements in a streaming fashion. It is important to close query execution objects, especially if not consuming all the results, because that causes the underlying JDBC result set to be closed. JDBC result sets can be a scarce system resource. If this option is set, but the JDBC connection is not streaming, then this feature is harmless. Setting it false caused SDB to read all results of an SQL statement at once, treating streamed connections as unstreamed. Note that this only streams results end-to-end if the underlying JDBC connection itself is set up to stream. Most do not in the default configuration to reduce transient resource peaks on the server under load. Setting the fetch size enables cursors in some databases but there may be restrictions imposed by the database. See the documentation for your database for details. In addition, operations on the graph API can be made streaming by also setting the Graph API to streaming. Annotated SQL SQL generation can include SQL comments to show how SPARQL has been turned into SQL. This option is true by default and always set for the command sdbprint.
SDB.getContext().setFalse(SDB.annotateGeneratedSQL) ;

SDB/Database Layouts
From Jena Wiki Jump to: navigation, search SDB does not have a single database layout. This page is an informal overview of the two main types ("layout2/hash" and "layout2/index"). In SDB one store is one RDF dataset is one SQL database. Databases of type layout2 have a triples table for the default graph, a quads table for the named graphs. In the triples and quads tables, the columns are integers referencing a nodes table. In the hash form, the integers are 8-byte hashes of the node. In the index form, the integers are 4-byte sequence ids into the node table. Triples
+-----------+ | S | P | O | +-----------+

Primary key: SPO Indexes: PO, OS

Quads
+---------------+ | G | S | P | O | +---------------+

Primary key: GSPO Indexes: GPO, GOS, SPO, OS, PO. Nodes In the index-based layout, the table is:
+------------------------------------------------+ | Id | Hash | lex | lang | datatype | value type | +------------------------------------------------+

Primary key: Id Index: Hash Hash:


+-------------------------------------------+ | Hash | lex | lang | datatype | value type | +-------------------------------------------+

Primary key: Hash All character fields are unicode, supporting any character set, including mixed language use.

SDB/FAQ
From Jena Wiki Jump to: navigation, search Tune your database Database performance depends on the database being tuned. Some databases default to "developer setup" which does not use much of the RAM but is only for functional testing. Improving loading rates For a large bulk load into an existing store, dropping the indexes, doing the load and then recreating the indexes can be noticeably faster.
sdbconfig --drop sdbload file sdbconfig --index

For a large bulk load into an new store, just format it, and not create the indexes, do the load and then recreating the indexes can be noticeably faster.
sdbconfig --format sdbload --time file sdbconfig --index

SDB/Joseki Integration
From Jena Wiki Jump to: navigation, search Joseki is a server that implements the SPARQL protocol for HTTP. It can be used to give a SPARQL interface to an SDB installation. The Joseki server needs the SDB jar files on it's classpath. The Joseki configuration file needs to contain two triples to integrate SDB:
## Initialize SDB. [] ja:loadClass "com.hp.hpl.jena.sdb.SDB" . ## Declare that sdb:DatasetStore is an implementation of ja:RDFDataset . sdb:DatasetStore rdfs:subClassOf ja:RDFDataset .

then a Joseki service can use an SBD-implemented dataset:


<#books> rdf:type sdb:DatasetStore ; sdb:store <#store> . <#store> rdf:type sdb:Store ; rdfs:label "SDB" ; sdb:layout "layout2" ; sdb:connection [ rdf:type sdb:SDBConnection ; sdb:sdbType "postgresql" ; sdb:sdbHost "localhost" ; sdb:sdbName "SDB" ; ] .

To enable pooling of connections to the SDB store, usethe joseki:poolSize property. This causes Joseki to create a pool of SDB datasets, each with it's own JDBC connection. This requires Joseki 3.2.
<#sdb> rdf:type sdb:DatasetStore ; joseki:poolSize 5 ; sdb:store <#store> . # Number of concurrent connections allowed to this dataset.

SDB 1.0:
## Dataset in SDB. <#books> rdf:type sdb:DatasetStore , ja:RDFDataset ; rdfs:label "Books" ; sdb:layout "layout2" ; sdb:connection [ rdf:type sdb:SDBConnection ; sdb:sdbType "postgresql" ; sdb:sdbHost "localhost" ; sdb:sdbName "SDB" ; ] .

The database installation does not need to accept public requests, it needs only to be accessible to the Joseki server itself. There is an example configuration file for a Joseki server using SDB in the Joseki distribution.

SDB/Databases Supported
From Jena Wiki Jump to: navigation, search Supported Databases Oracle 10g Including OracleXE

Microsoft SQL Server 2005 Including MS SQL Express DB2 9 PostgreSQL MySQL Apache Derby H2 HSQLDB Including DB2 9 Express v8.2 v5.0.22 v10.2.2.0 1.0.71 1.8.0

Support for a version implies support for later versions unless otherwise stated. Microsoft SQL Server 2000 is also reported to work. H2 support was contributed by Martin Hein (March 2008). IBM DB2 support was contributed by Venkat Krishnamurthy (October 2007). Please report earlier versions that also work.

SDB/NotesPostgreSQL
From Jena Wiki Jump to: navigation, search

PostgreSQL specific notes.

Contents
[hide]

1 Databases must use UTF-8 encoding 2 Improving loading rates 3 Tuning

Databases must use UTF-8 encoding


Create SDB stores with encoding UTF-8. International character sets can cause corrupted databases otherwise. The database will not pass the SDB test suite.

Set this when creating the database with pgAdmin or if you use the command line, for example:
CREATE DATABASE "YourStoreName" WITH OWNER = "user" ENCODING = 'UTF8' TABLESPACE = pg_default;

Improving loading rates


The index layout ("layout2/index") usually loads faster than the hash form. Existing store When loading into an existing store, where there is existing data and ANALYZE has been run, the process is:

Drop indexes

sdbconfig --drop

Load data

sdbload file

Redo the indexes

sdbconfig --index

Fresh store PostgreSQL needs statistics to improve load performance through the use of ANALYSE. When loading the first time, there are no statistics so, for a large load, it is advisable to load a sample, run ANALYSE and then load the whole data.

Create the database without indexes (just the primary keys).

sdbconfig --format

Load a sample of the triples (say, a 100K or a million triples - until

the load rate starts to drop appreciably). The sample must be representative of the data.
sdbload --time sample

Run ANALYZE on the database. If your sample is one part of a large set of files, this set is not necessary at all. If you are loading one single large file then you might wish to empty the database. This is only needed if the data has bNodes in

it because the load process suppresses duplicates.


sdbconfig --truncate

Now load the data or rest of the data.

sdbload --time file

Add the indexes. This only takes a few minutes even on a very large store but calculating them during loading (that is, --create, not --format) is noticeably slower.

sdbconfig --index

Run ANALYZE on the database again.

Tuning
It is essential to run the PostgreSQL ANALYZE command on a database, either during or after building. This is done via the command line psql or via pgAdmin. The PostgreSQL documentation describes ways to run this as a background daemon. Various of the PostgreSQL configuration parameters will affect performance, particularly effective_cache_size. The parameter enable_seqscan may help avoid some unexpected slow queries.

SDB/NotesMySQL
From Jena Wiki Jump to: navigation, search

Contents
[hide]

1 National Characters 2 Connection timeouts 3 Tuning

National Characters
SDB formats all table columns used for storing text in the MySQL schema to UTF-8. However, this does not cause the data to be transmitted in UTF-8 over the JDBC connection. The best way is to run the server with a default character set of UTF-8. This is set in the MySQL server configuration file:
[mysql] default-character-set=utf8

A less reliable way is to pass parameters to the JDBC driver in the JDBC URL. The application will need to explicitly set the JDBC URL in the store configuration file.
...?useUnicode=true&characterEncoding=UTF-8

Connection timeouts
If you get the connection timing out after (by default) 8 hours of no activity, try setting autoReconnect=true in the JDBC URL.

Tuning
1. For InndoDB, the critical parameter is innodb_buffer_pool_size. See the MySQL sample configuration files for details. 2. Using ANALYZE TABLE on the database tables can improve the choices made by the MySQL optimizer.

jena - Revision 8120: /SDB


.. README.txt branches/

tags/ trunk/

Powered by Subversion version 1.6.9 (r901367).

SDB/Support
From Jena Wiki Jump to: navigation, search Variations in the setups of databases make it hard to diagnose problems by email without precise and detailed information. Questions should be sent to: <mailto:jena-dev@groups.yahoo.com> Include a description of the database setup, version numbers of all software and the database. A complete, minimal example should be included. Complete means sufficient information, to exactly recreate the situation. Minimal means no more data than is needed to demonstrate the situation. Large data files are not minimal; it is very rare that the problem can not be illustrated with less than 10 triples. Queries should be as the parser sees them, not as Java strings. Use the command line tools to produce the complete minimal example. If a query gives the unexpected results, include some data; the query, stripped to the non-performing part; and a description of the expected results. You should compare the results by running ARQ's reference query engine first.
arq.sparql --engine=ref --data=datafile --query=queryfile

SDB/Loading data
From Jena Wiki Jump to: navigation, search There are three ways to load data into SDB: 1. Use the command utility sdbload 2. Use one of the Jena model.read operations 3. Use the Jena model.add The last one of these requires the application to signal the beginning and end of batches.

Loading with Model.read


A Jena Model obtained from SDB via:
SDBFactory.connectModel(store)

will automatically bulk load data for each call of one of the Model.read operations.

Loading with Model.add


The Model.add operations, in any form or combination of forms, whether loading a single statement, list of statements, or another model, will invoke the bulk loader if previously notified before an add operation. You can also explicitly delimit bulk operations:
model.notifyEvent(GraphEvents.startRead) ... do add/remove operations ... model.notifyEvent(GraphEvents.finishRead)

Failing to notify the end of the operations will result in data loss. A try/finally block can ensure that the finish is notified.
model.notifyEvent(GraphEvents.startRead) ; try { ... do add/remove operations ... } finally { model.notifyEvent(GraphEvents.finishRead) ; }

The model.read operations do this automatically. The bulk loader will automatically chunk large sequences of additions to sizes appropriate to the underlying database. The bulk loader is threaded with double-buffered; loading to the database happens in parallel to the application thread and any RDF parsing.

How the loader works


Loading consists of two phases: in the java vm, and on the database itself. The SDB loader takes incoming triples and breaks them down into components ready for the database. These prepared triples are added to a queue for the database phase, which (by default) takes place on a separate thread. When the number of triples reaches a limit (default 20,000), or finish update is signaled, the triples are passed to the database. You can configure whether to use threading and the 'chunk size' -- the number of triples per load event -- via StoreLoader.
Store store; // SDB Store ... store.getLoader().setChunkSize(5000); // store.getLoader().setUseThreading(false); // Don't thread

You should set these before the loader has been used. Each loader sets up two temporary tables (NNode and NTrip) that mirror Nodes and Triples tables. These tables are virtually identical, except that a) they are not indexed and b) for the index variant there is no index column for nodes. When loading prepared triples -- triples that have been broken down ready for the database -- are passed to the loader core (normally running on a different thread). When the chunk size is reached, or we are out of triples, the following happens:

Prepared nodes are added in one go to NNode. Duplicate nodes within a chunk are suppressed on the java side (this is worth doing since they are quite common, e.g. properties). Prepared triples are added in one go to NTrip. New nodes are added to the node table (duplicate suppression is explained below). New triples are added to the triple table (once again suppressing dupes). For the index case this involves joining on the node table to do a hash to index lookup. We commit. If anything goes wrong the transaction (the chunk) is rolled back, and an exception is thrown (or readied for throwing on the calling thread).

Thus there are five calls to the database for every chunk. The database handles almost all of the work uninterrupted (duplicate suppression, hash to index lookup), which makes loading reasonably quick.

Duplicate Suppression

MySQL has a very useful INSERT IGNORE, which will keep going, skipping an offending row if a uniqueness constraint is violated. For other databases we need something else. Having tried a number of options the best seems to be to INSERT new items by LEFT JOIN new items to existing items, then filtering WHERE (existing item feature) IS NULL. Specifically, for the triple hash case (where no id lookups are needed):
INSERT INTO Triples SELECT DISTINCT NTrip.s, NTrip.p, NTrip.o -- DISTINCT because new nodes may contain duplicates (not so for nodes) NTrip LEFT JOIN Triples ON (NTrip.s=Triples.s AND NTrip.p=Triples.p AND NTrip.o=Triples.o) WHERE Triples.s IS NULL OR Triples.p IS NULL OR Triples.o IS NULL

SDB/Loading performance
From Jena Wiki Jump to: navigation, search

Contents
[hide]

1 Introduction 2 The Databases and Hardware o 2.1 Hardware o 2.2 Windows setup o 2.3 Linux setup 3 The Dataset and Queries o 3.1 LUBM o 3.2 dbpedia 4 Loading 5 Results 6 Uniprot 700m loading: Tuning Helps

Introduction
Performance reporting is an area prone to misinterpretation, and such reports should be liberally decorated with disclaimers. In our case there are an alarming number of variables: the hardware, the operating system, the database engine and its myriad parameters, the data itself, the queries, and planetary alignment. Given this here is some basic information. You may find it sufficient:

Loading speed will be in the thousands of triples per second range. Expect to load around 5 million triples per hour. Index layout is usually better than hash for loading speed. Hash loading is very bad on MySQL. Hash layout is better for query speed.

We suggest that you don't choose your database based on these figures. The performance is broadly similar, so if you already have a relational database installed this is your best option.

The Databases and Hardware


SDB supports a range of databases, but the figures here are limited to SQLServer and Postgresql. The hardware used was identical, although running linux (for Postgresql) and windows (for SQLServer).

Hardware

Dual AMD Opteron processors, 64 bit, 1.8 GHz. 8 GB memory. 80 GB disk for database.

Windows setup

Windows server 2003 Java 6 64 bit SQLServer 2005

Linux setup

Redhat Enterprise Linux 4 Java 6 64 bit Postgresql 8.2

The Dataset and Queries


We use the Lehigh University Benchmark http://swat.cse.lehigh.edu/projects/lubm/ and dbpedia http://dbpedia.org/, together with some example queries that each provides. You can find the queries in SDB/PerfTests. LUBM LUBM generates artifical datasets. To be useful one needs to apply reasoning, and this was done in advance of loading. The queries are quite stressful for SDB in that they are not very ground (in many neither subjects nor objects are present), and many produce very large result sets. Thus they are probably atypical of many SPARQL queries.

Size: 19 million triples (including inferred triples).

dbpedia The dbpedia queries are, unlike LUBM, quite ground. dbpedia contains many large literals, in contrast to LUBM.

Size: 25 million triples.

Loading
All operations were performed using SDB's command line tools. The data was loaded into a freshly formatted SDB store -- although postgresql needs an ANALYSE to avoid silly planning -- then the additional indexes were added.

Results
Database loading Speed (tps) Index time (s) Size (MB) LUBM Postgres (Hash) LUBM Postgres (Index) 4972 8658 199 176 121 68 298 5124 3666 3200 2029 10193

LUBM SQLServer (Hash) 8762 LUBM SQLServer (Index) 7419 DBpedia Postgres (Hash) 3029

DBpedia Postgres (Index) 4293 DBpedia SQLServer (Hash) 5345 DBpedia SQLServer (Index) 4749

227 162 110

6251 6349 4930

Uniprot 700m loading: Tuning Helps


To illustrate the variability in loading speed, and emphasise the importance of tuning, consider the case of Uniprot http://dev.isb-sib.ch/projects/uniprot-rdf/. Uniprot contains (at the time of writing) around 700 million triples. We loaded these on to the SQLServer setup given above, but with the following changes:

The database was stored on a separate disk. The database's transactional logs were stored on yet another disk.

So the rdf data, database data, and log data were all on distinct disks. Loading into an index-layout store proceeded at:

11079 triples per second

SDB/Query
From Jena Wiki Jump to: navigation, search SDB supports various layouts but the overall process of compiling a query is common to all layouts. Each layout provides the concrete, basic steps in mapping a query into the concrete database details.
sdbquery

can be used to print out the SQL that would be used for a query.

SDB/Query performance
From Jena Wiki Jump to: navigation, search This page compares the effect of SDB with RDB, Jena's usual database layout. RDB was designed for supporting the fine-grained API calls as well as having some support for basic graph patterns. Therefore, the RDB design goals were not those of SDB. RDB uses a denormalised database layout in order that all statement-level operations do not require additional joins. The SDB layout is normalised so that the triple table is narrower and uses integers for RDF nodes, then does do joins to get the node representation. This optimizers for longer patterns, not API operations. These figures were taken July 2007. As with any performance figures, these should be taken merely as a guide. The shape of the data, the hardware details, choice of database, and it's configuration (particuarly amount of memory used), as well as the queries themselves all greatly contribute to the execution costs.

Contents
[hide]

1 Setup 2 LUBM Query 1 3 LUBM Query 2 4 Summary

Setup
Database and hardware setup was the same as for the load performance tests. Data was taken generated with the LUBM test generator (with N = 15), then the inference expanded on loading to give about 19.5 million triples. This data is larger than the database could completely cache. The queries are taken the LUBM suite and rewritten in SPARQL.

LUBM Query 1
PREFIX PREFIX SELECT { ?x ?x } rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#> * WHERE rdf:type ub:GraduateStudent . ub:takesCourse <http://www.Department0.University0.edu/GraduateCourse0> .

Jena: 24.16s SDB/index: 0.014s SDB/hash: 0.04s

LUBM Query 2
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#> SELECT * WHERE { ?x rdf:type ub:GraduateStudent . ?y rdf:type ub:University . ?z rdf:type ub:Department . ?x ub:memberOf ?z . ?z ub:subOrganizationOf ?y . ?x ub:undergraduateDegreeFrom ?y . }

This query searches for a particular pattern in the data without specific starting point. Jena: 232.1s (153s with an addition index on OP) SDB/index: 12.7s SDB/hash: 3.7s Notes: Removing the rdf:type statements actually slows the query down.

Summary
In SPARQL queries, there is often a sufficiently complex graph pattern that the SDb design tradeoff provides significant advantages in query performance.

Вам также может понравиться