Академический Документы
Профессиональный Документы
Культура Документы
TDB is a component of Jena for RDF storage and query, as well as the full range of Jena APIs. An TDB store can be accessed and managed with the provided command line scripts and via the Jena API.
Contents
[hide]
Status
TDB can be used as a high performance, non-transactional, RDF store on a single machine. Documentation on the wiki describes the latest version unless otherwise noted.
Downloads
TDB is distributed from the Jena project SourceForge. Jena Download area It is available from the Jena development Maven repository for Maven and Apache Ivy. Jena repository Jena development repository
Documentation
TDB Installation TDB Requirements Command line utilities Use from Java Use on 64 bit or 32 bit Java systems Datasets and Named Graphs Assemblers for Graphs and Datasets Dynamic Datasets: Query a subset of the named graphs Quad filtering: Hide information in the dataset Value Canonicalization TDB Design The TDB Optimizer TDB Configuration Joseki Integration
Subversion
TDB Subversion repository on SourceForge.
Support
Email to jena-dev@groups.yahoo.com.
TDB/Installation
TDB requires Java 6 (versions to 0.8.1) and Java 5 for version afterwards. TDB is distributed as a complete download and also via Apache Maven (groupId com.hp.hpl.jena, artifact tdb). The Jena repositories are:
http://openjena.org/repo (this is mirrored to the central maven repository). http://openjena.org/repo-dev for development builds.
After downloading a TDB distribution, place all the jars in the lib/ directory on the classpath. TDB itself is a single jar, the other jars in lib/ are a consistent snapshot at the time of release. The TDB build system uses maven but there is also an ant build as well using "ant jar".
TDB/Requirements
TDB up to version 0.8.1 requires Java 6. Versions 0.8.2 onwards require Java 5. TDB can run on 32 bit or 64 bit JVMs. It adapts to the underlying architecture by choosing different file access mechanisms. 64 bit Java is preferred for large scale and production deployments. On 64 bit Java, TDB uses memory mapped files. On 32 bit platforms, TDB uses in-heap caching of file data. In practice, the JVM heap size should be set to at least 1Gbyte. While there is no inherent scaling limits on the size of the database but, in practice, only one large dataset can be handled per TDB instance. The on-disk file format is compatible between 32 and 64 bit systems and databases can be transferred between systems by file copy if the databases are not in use (no TDB instance is accessing them at the time). Databases can not be copied while TDB is running, even if TDB is not actively processing a query or update.
TDB/Commands
Contents
[hide]
1 Scripts o 1.1 Script set up o 1.2 Argument Structure o 1.3 Setting options from the command line 2 TDB Commands o 2.1 Store description o 2.2 tdbloader o 2.3 tdbquery o 2.4 tdbdump o 2.5 tdbstats
Scripts The directory bin/ contains shell scripts to run the commands from the command line. The scripts are bash scripts which also run over Cygwin. Script set up Set the environment variable TDBROOT to the root of the the TDB installation. They are bash scripts, and work on Linux and Cygwin for MS Windows.
$ PATH=$TDBROOT/bin:$PATH
Alternatively, there are wrapper scripts in $TDBROOT/bin2 which can be placed in a convenient directory that is already on the shell command path. Argument Structure Each command then has command-specific arguments described below. All commands support --help to give details of named and positional arguments. There are two equivalent forms of named argument syntax:
--arg=val --arg val
Setting options from the command line TDB has a number of configuration options which can be set from the command line using:
--set tdb:symbol=value
Using tdb: is really a short hand for the URI prefix http://jena.hpl.hp.com/TDB# so the full URI form is
--set http://jena.hpl.hp.com/TDB#symbol=value
TDB Commands
Store description TDB commands use an assembler description for the persistent store
--desc=assembler.ttl --tdb=assembler.ttl
or a direct reference to the directory with the index and node files:
--loc=DIRECTORY --location=DIRECTORY
The assembler description follow the form for a dataset given in TDB assembler description page. If neither assembler file nor location is given, --desc=tdb.ttl is assumed.
tdbloader
Bulk loader and index builder. Performan bulk load operations more efficiently than simply reading RDF into a TDBback model.
tdbquery
Invoke a SPARQL query on a store. Use --time for timing information. The store is attached on each run of this command so timing includes some overhead not present in a running system. Details about query execution can be obtained -- see notes on the TDB Optimizer.
tdbdump
(Version 0.8.5)
tdbstats
Produce a statistics for the dataset. See the TDB Optimizer description..
TDB/JavaAPI
All the operations of the Jena API including the SPARQL query support. The application obtains a model or RDF datasets from TDB then uses it as for any other model or dataset. See also Concurrency and Locking.
Contents
[hide]
1 Constructing a model or dataset o 1.1 Using a directory name o 1.2 Using an assembler file 2 Bulkloader 3 Caching and synchronization 4 Concurrency
dataset.close() ;
Bulkloader
The bulkloader is a faster way to load data into an empty graph than just using the Jena update operations. It is accessed through the command line utility tdbloader.
In addition, while TDB does not support full transaction semantics, the Model.commit does provide for synchronising the in-memory and disk states.
model.commit() ;
TDB provides an explicit call for model and dataset objects for synchronization with disk:
Model model = ... ; TDB.sync(model) ; Dataset dataset = ... TDB.sync(dataset ) ; ;
Any dataset or model can be passed to these functions - if they are not backed by TDB then no action is taken and the call merely returns without error.
Concurrency
TDB provides a Multiple Reader or Single Writer (MRSW) policy for concurrency access. Applications are expected to adhere to this policy - it is not automatically checked. One gotcha is Java iterators. An iterator that is moving over the database is making read operations and no updates to the dataset are possible while an iterator is being used.
TDB/JVM-64-32
From Jena Wiki Jump to: navigation, search TDB runs on both 32-bit and 64-bit Java Virtual Machines. The same file formats are used on both systems and database files can be transferred between architectures (no TDB system should be running for the the database at the time of copy). What differs is the file access mechanism used.
The file access mechanism can be set explicitly, but this is not a good idea for production usage, only for experimentation - see the File Access mode option.
64-bit Java
On 64-bit Java, TDB uses memory mapped files and the operating system handles much of the caching between RAM and disk. The amount of RAM used for file caching increases and decreases as other application run on the machine. The fewer other programs running on the machine, the more RAM will be available for file caching. TDB is faster on a 64 bit JVM because more memory is available for file caching.
32-bit Java
On 32-bit Java, TDB uses it's own file caching to enable large databases. 32-bit Java limits the address space of the JVM to about 1.5Gbytes (the exact size is JVM-dependent), and this includes memory mapped files, even though they are not in the Java heap. The JVM heap size may need to be increased to make space for the disk caches used by TDB.
TDB/Datasets
From Jena Wiki Jump to: navigation, search An RDF Dataset is a collection of one, unnamed, default graph and zero, or more named graphs. In a SPARQL query, a query pattern is matched against the default graph unless the GRAPH keyword is applied to a pattern.
Contents
[hide]
Dataset Storage
One file location (directory) is used to store one RDF dataset. The unnamed graph of the dataset is held as a single graph while all the named graphs are held in a collection of quad indexes. Every dataset is obtained via TDBFactory.createDataset(Location) within a JVM is the same dataset. If a model is obtained from via TDBFactory.createModel(Location) there is a hidden, shared dataset and the appropriate model is returned.
Dataset Query
(TDB Version 0.7.0 and later) There is full support for SPARQL query over named graphs in a TDB-back dataset. All the named graphs can be treated as a single graph which is the union (RDF merge) of all the named graphs. This is given the special graph name <urn:x-arq:UnionGraph> in a GRAPH pattern. When querying the RDF merge of named graphs, the default graph in the store is not included. This feature applies to queries only. It does not affect the storage nor does it change loading. Alternatively, if the symbol tdb:unionDefaultGraph (see TDB Configuration) is set, the unnamed graph for the query is the union of all the named graphs in the datasets. The stored default graph is ignored and is not part of the the data of the union graph although it is accessible by the special name <urn:x-arq:DefaultGraph> in a GRAPH pattern.
Meaning The RDF merge of all the named graphs in the datasets of the query. The default graph of the dataset, used when the default graph of the query is the union graph.
urn:x-arq:DefaultGraph
Note that setting tdb:unionDefaultGraph does not affect the default graph or default model obtained with dataset.getDefaultModel(). The RDF merge of all named graph can be accessed as the named graph urn:x-arq:UnionGraph using Dataset.getNamedModel("urn:x-arq:UnionGraph") .
TDB/Assembler
From Jena Wiki Jump to: navigation, search Assemblers are a general mechanism in Jena to describe objects to be built, often these objects are models and datasets. Assemblers are used heavily in Joseki for datset and model descriptions, for example. SPARQL queries operate over an RDF dataset, which is a unnamed, default graph and zero or more named graphs. Having the description in a file means that the data that the application is going to work on can be changed without changing the program code.
Dataset
This is needed for use in Joseki. A dataset can be constructed in an assembler file:
@prefix tdb: @prefix rdf: @prefix ja: <http://jena.hpl.hp.com/2008/tdb#> . <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . <http://jena.hpl.hp.com/2005/11/Assembler#> .
Only one dataset can be stored in a location (filing system directory). The first section declares the prefixes used later:
@prefix tdb: @prefix rdf: @prefix ja: <http://jena.hpl.hp.com/2008/tdb#> . <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . <http://jena.hpl.hp.com/2005/11/Assembler#> .
then there is a statement that causes TDB to be loaded. TDB initialization occurs automatically when loaded. The TDB jar must be on the Java classpath.
While order in this file does not matter to the machine, because in this case the jena assembler system checks for any ja:loadClass statements before any attempt to assemble an object is made, having it early in the file is helpful to any person looking at the file.
[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
The property tdb:location gives the file name as a string. It is relative to the applications current working directory, not where the assembler file is read from. The dataset description is usually found by looking for the one subject with type tdb:GraphDataset. If more than one graph is given in a single file, the application will have to specify which description it wishes to use.
Graph
A single graph can be described as well:
@prefix tdb: @prefix rdf: @prefix ja: <http://jena.hpl.hp.com/2008/tdb#> . <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . <http://jena.hpl.hp.com/2005/11/Assembler#> .
but note that this graph is a single graph at that location; it is the default graph of a dataset. A particular named graph in the dataset at a location can be assembled with:
@prefix tdb: @prefix rdf: @prefix ja: <http://jena.hpl.hp.com/2008/tdb#> . <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . <http://jena.hpl.hp.com/2005/11/Assembler#> .
It is also possible to describe a graph, or named graph, in a dataset where the dataset description can now be shared:
<#graph2> rdf:type tdb:GraphTDB ; tdb:dataset <#dataset> ; . <#dataset> rdf:type tdb:DatasetTDB ; tdb:location "DB" ; .
Mixed Datasets
It is possible to create a dataset with graphs backed by different storage subsystems, although query is not necessarily as efficient. To include as a named graph in a dataset use vocabulary as shown below:
@prefix tdb: @prefix rdf: @prefix rdfs: <http://jena.hpl.hp.com/2008/tdb#> . <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . <http://www.w3.org/2000/01/rdf-schema#> .
@prefix ja:
<http://jena.hpl.hp.com/2005/11/Assembler#> .
[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" . tdb:DatasetTDB rdfs:subClassOf ja:RDFDataset . tdb:GraphTDB rdfs:subClassOf ja:Model . # A dataset of one TDB-backed graph as the default graph and an in-memory graph as a named graph. <#dataset> rdf:type ja:RDFDataset ; ja:defaultGraph <#graph> ; ja:namedGraph [ ja:graphName <http://example.org/name1> ; ja:graph <#graph2> ] ; . <#graph> rdf:type tdb:GraphTDB ; tdb:location "DB" ; . <#graph2> rdf:type ja:MemoryModel ; ja:content [ja:externalContent <file:Data/books.n3> ] ; .
which provides for integration with complex model setups, such as reasoners.
TDB/DynamicDatasets
From Jena Wiki Jump to: navigation, search (TDB version 0.8.5 and later) This feature allows a query to be made on a subset of all the named graphs in the TDB storage datasets. The SPARQL GRAPH pattern allows access to either a specific named graph or to all the named graph in a dataset. This feature means that only specified named graphs are visible to the query. SPARQL has the concept of a dataset description. In a query string, the clauses for FROM and FROM NAMED specify the dataset. The FROM clauses define the graphs that are merged to form the default graph, and the FROM NAMED clauses identify the graphs to be included as named graphs. Normally, ARQ interprets these as coming from the web, that is the graphs are read using HTTP GET. TDB modifies this behavior; instead of the universe of graphs being the web, the universe of graph is the TDB data store. FROM and FROM NAMED describe a dataset with graphs drawn only from the TDB data store.
Just using one or more FROM clauses, with no FROM NAMED in a query, leaves the named graphs as all the named graphs in the data store. Just using one or more FROM NAMED, with no FROM in a query, causes an empty default graph to be used. If the symbol TDB.symUnionDefaultGraph is also set, then the default graph is set union of all the named graphs (FROM NAMED) and the graphs already used for the default graph via FROM. urn:x-arq:UnionGraph and urn:x-arq:DefaultGraph explicitly name the union of named graphs (FROM NAMED) and the described default graph (union of <tt>FROM) directly.
# Follow a foaf:knows path across both Alice and Bobs FOAF data # where the data is in the datastore as named graphs. BASE <http://example> SELECT ?zName FROM <alice-foaf> FROM <bob-foaf>
TDB/Datasets
From Jena Wiki Jump to: navigation, search
An RDF Dataset is a collection of one, unnamed, default graph and zero, or more named graphs. In a SPARQL query, a query pattern is matched against the default graph unless the GRAPH keyword is applied to a pattern.
Contents
[hide]
Dataset Storage
One file location (directory) is used to store one RDF dataset. The unnamed graph of the dataset is held as a single graph while all the named graphs are held in a collection of quad indexes. Every dataset is obtained via TDBFactory.createDataset(Location) within a JVM is the same dataset. If a model is obtained from via TDBFactory.createModel(Location) there is a hidden, shared dataset and the appropriate model is returned.
Dataset Query
(TDB Version 0.7.0 and later) There is full support for SPARQL query over named graphs in a TDB-back dataset. All the named graphs can be treated as a single graph which is the union (RDF merge) of all the named graphs. This is given the special graph name <urn:x-arq:UnionGraph> in a GRAPH pattern. When querying the RDF merge of named graphs, the default graph in the store is not included. This feature applies to queries only. It does not affect the storage nor does it change loading. Alternatively, if the symbol tdb:unionDefaultGraph (see TDB Configuration) is set, the unnamed graph for the query is the union of all the named graphs in the datasets. The stored default graph is ignored and is not part of the the data of the union graph although it is accessible by the special name <urn:x-arq:DefaultGraph> in a GRAPH pattern. Special Graph Names URI
urn:x-arq:UnionGraph
Meaning The RDF merge of all the named graphs in the datasets of the query. The default graph of the dataset, used when the default graph of the query is the union graph.
urn:x-arq:DefaultGraph
Note that setting tdb:unionDefaultGraph does not affect the default graph or default model obtained with dataset.getDefaultModel(). The RDF merge of all named graph can be accessed as the named graph urn:x-arq:UnionGraph using Dataset.getNamedModel("urn:x-arq:UnionGraph") .
TDB/QuadFilter
From Jena Wiki Jump to: navigation, search (TDB version 0.8.7 and later) This page describes how to filter quads at the lowest level of TDB. It can be used to hide certain quads (tripes in named graphs) or triples. The code for the example on this page can be found in the TDB download: srcexamples/tdb.examples/ExQuadFilter.java Filtering quads should be used with care. The performance of the tuple filter callback is critical. See also Dynamic Datasets to select only certain specified named graphs for a query. TDB will call a registered filter on every quad that it retrieves from any of the indexes, both quads (for named graphs) and triples (for the stored default graph). This filter indicates whether to accept or reject the quad or triple. This happens during basic graph pattern processing. A rejected quad is simply no processed further in the basic graph patten and it is as if it is not in the dataset. The filter has a signature of:
// org.openjena.atlas.iterator.Filter interface Filter<T> { public boolean accept(T item) ; }
with a type parameter of Tuple<NodeId>. NodeId is the low level internal identifier TDB uses for RDF terms. Tuple is a class for a immutable tuples of values of the same type.
/** Create a filter to exclude the graph http://example/g2 */ private static Filter<Tuple<NodeId>> createFilter(Dataset ds) { DatasetGraphTDB dsg = (DatasetGraphTDB)(ds.asDatasetGraph()) ; NodeTable nodeTable = dsg.getQuadTable().getNodeTupleTable().getNodeTable() ; // Filtering operates at a very low level: // need to know the internal identifier for the graph name. final NodeId target = nodeTable.getNodeIdForNode(Node.createURI("http://example/g2")) ; // Filter for accept/reject as quad as being visible. Filter<Tuple<NodeId>> filter = new Filter<Tuple<NodeId>>() { public boolean accept(Tuple<NodeId> item) { // Quads are 4-tuples, triples are 3-tuples. if ( item.size() == 4 && item.get(0).equals(target) ) // reject return false ; // Accept return true ; } } ; return filter ; }
To install a filter, put it in the context of a query execution under the symbol SystemTDB.symTupleFilter
Dataset ds = ... ; Filter<Tuple<NodeId>> filter = createFilter(ds) ; Query query = ... ; QueryExecution qExec = QueryExecutionFactory.create(query, ds) ; qExec.getContext().set(SystemTDB.symTupleFilter, filter) ;
TDB/ValueCanonicalization
From Jena Wiki Jump to: navigation, search TDB canonicalizes certain XSD datatypes. The value of literals of these datatypes is stored, not the original lexical form. For example, "01"^^xsd:integer, "1"^^xsd:integer and "+001"^^xsd:integer are all the same value and are stored as the same RDF literal. In addition, derived types for integers are also understood by TDB. For example, "01"^^xsd:integer and "1"^^xsd:byte are the same value. When RDF terms for these values are returned, the lexical form will be the canonical representation. Only certain ranges of values are directly encoded as values. If a literal is outside the canonicalization range, its lexical representation is stored. TDB transparently switches between value and non-value based literals in graph matching and filter expressions; non-canonicalized and canonicalized values will be compared as needed. (Future versions of TDB may increase the ranges canonicalized.) The datatypes canonicalized by TDB are:
XSD decimal (canonicalized range: 8 bits of scale, signed 48 bits of value) XSD integer (canonicalized range: 56 bits) XSD dateTime (canonicalized range: 0 to the year 8000, millisecond accuracy, timezone to 15 minutes). XSD date (canonicalized range: 0 to the year 8000, timezone to 15 minutes). XSD boolean (canonicalized range: true and false)
TDB/Architecture
From Jena Wiki Jump to: navigation, search This page gives an overview of the TDB architecture. Specific details refer to TDB 0.8.
Contents
[hide]
1 Terminology 2 Design o 2.1 The Node Table o 2.2 Triple and Quad indexes o 2.3 Prefixes Table o 2.4 TDB B+Trees 3 Inline values 4 Query Processing 5 Caching on 32 and 64 bit Java systems
Terminology
Terms like "table" and "index" are used in this description. They don't directly correspond to concepts in SQL, For example, in SQL terms, there is no triple table; that can be seen as just having indexes for the table or, alternatively, there are 3 tables, each of which has a primary key and TDB manages the relationship between them.
Design
A dataset backed by TDB is stored in a single directory in the filing system. A dataset consists of
The node table Triple and Quad indexes The prefixes table
are 8 byte quantities. The Node to NodeId mapping is based on hash of the Node (a 128 bit MD5 hash - the length was found not to major performance factor). The default storage of the node table is a sequential access file for the NodeId to Node mapping and a B+Tree for the Node to NodeId mapping.
Prefixes Table
The prefixes table uses a node table and a index for GPU (Graph->Prefix->URI). It is usually small. It does not take part in query processing. It provides support for Jena's PrefixMappings used mainly for presentation and for serialisation of triples in RDF/XML or Turtle.
TDB B+Trees
Many of the persistent datastructures in TDB use a custom implementation of threaded B+Trees. The TDB implementation only provides for fixed length key and fixed length value. There is no use of the value part in triple indexes. The threaded nature means that long scans of indexes proceeds without needing to traverse the branches of the tree.
Inline values
Values of certain datatypes are held as part of the NodeId in the bottom 56 bits. The top 8 bits indicates the type external NodeId or the value space. The value spaces handled are (TDB 0.8): xsd:decimal, xsd:integer, xsd:dateTime, xsd:date and xsd:boolean. Each has it's own encoding to fit in 56 bits. If a node falls outside of the range of values that can be represented in the 56 bit encoding. The xsd:dateTime and xsd:date ranges cover about 8000 years from year zero with a precision down to 1 millisecond. Timezone information is retained to an accuracy of 15 minutes with special timezones for Z and for no explicit timezone. By storing the value, the exact lexical form is not recorded. The integers 01 and 1 will both be treated as a the value 1. Derived XSD datatypes are held as their base type. The exact datatype is not retained; the value of the RDF term is.
Query Processing
TDB uses the OpExecutor extension point of ARQ. TDB provides low level optimization of basic graph patterns using a statistics based optimizer.
TDB/Optimizer
From Jena Wiki Jump to: navigation, search Query execution in TDB involves both static and dynamic optimizations. Static optimizations are transformations of the SPARQL algebra performed before query execution begins; dynamic optimizations involve deciding the best execution approach during the execution phase and can take into account the actual data so far retrieved. The optimizer has a number of strategies: a statistics based strategy, a fixed strategy and a strategy of no reordering. For the preferred statistics strategy, the TDB optimizer uses information captured in a per-database statistics file. The file takes the form of a number of rules for approximate matching counts for triple patterns. The statistic file can be automatically generated. The user can add and modify rules to tune the database based on higher level knowledge, such as inverse function properties.
Contents
[hide]
1 Quickstart 2 Choosing the optimizer strategy 3 Filter placement 4 Investigating what is going on 5 Statistics Rule File o 5.1 Statistics Rule Language o 5.2 Abbreviated Rule Form o 5.3 Defaults 6 Generating a statistics file 7 Writing Rules
Feedback on the effects, good and bad, of the TDB optimizer would be appeciated. Mail to jena-dev.
Quickstart
This section provides a practical how-to. 1. Load data. 2. Generate the statistics file. Run tdbconfig stats. 3. Place the file generated in the database directory with the name stats.opt.
Effect No reordering - execute triple patterns in the order in the query Use a built-in reordering based on the number of variables in a triple pattern. The contents of this file are the weighing rules (see below).
fixed.opt
stats.opt
The contents of the files none.opt and fixed.opt are not read and don't matter. They can be zero-length files.
If more then one file is found, the choice is made: stats.opt over fixed.opt over none.opt. The "no reorder" strategy can be useful in investigating the effects. Filter placement still takes place.
Filter placement
One of the key optimization is of filtered basic graph patterns. This optimization decides the best order of triple patterns in a basic graph pattern and also the best point at which to apply the filters within the triple patterns. Any filter expression of a basic graph pattern is placed immediately after all it's variables will be bound. Conjunctions at the top level in filter expressions are broken into their constituent pieces and placed separately.
The context setting is for key (Java constant) TDB.symLogExec. To set globally:
TDB.getContext().set(TDB.symLogExec,true) ;
and it may also be set on an individual query execution using it's local context.
QueryExecutiuon qExec = QueryExecutionFactory.create(...) ; qExec.getContext().set(TDB.symLogExec,true) ;
TDB version 0.8.3 provides more fine-grained logging controls. Instead of "true", which sets all levels, the following can be used: Explanation Levels Level INFO FINE ALL NONE Log each query Log each query and it's algebra form after optimization Log query, algebra and every database access (can be expensive) No information logged Effect
These can be specified as string, to the command line tools, or using the constants in Explain.InfoLevel.
qExec.getContext().set(TDB.symLogExec,Explain.InfoLevel.FINE) ;
that is, a meta block and a number of pattern rules. A simple example:
(prefix ((: <http://example/)) (stats (meta (timestamp "2008-10-23T10:35:19.122+01:00"^^<http://www.w3.org/2001/XMLSchema#dateTime>) (run@ "2008/10/23 10:35:19") (count 11)) (:p 7) (<http://example/q> 7) ))
This example statistics file contains some metadata about statistics (time and date the file was generated, size of graph), the frequence count for two predicates http://example/p (written using a prefixed name) and http://example/q (written in full). The numbers are the estimated counts. They do not have to be exact - they guide the optimizer in choosing one execution plan over another. They do not have to exactly up-to-date providing the relative counts are representative of the data. Statistics Rule Language A rule is made up of a triple pattern and a count estimation for the approximate number of matches that the pattern will yeild. This does have to be exact, only an indication. In addition, the optimizer considers which variables will be bound to RDF terms by the time a triplepatetrn is reached in the execution plan being considered. For example, in the basic graph pattern:
{ ?x ?x } :identifier :name 1234 . ?name .
then ?x will be bound in pattern ?x :name ?name to an RDF term if executed after the pattern ?x :identifier 1234. A rule is of the form:
( (subj pred obj) count)
where subj, pred, obj are either RDF terms or one of the tokens in the following table:
Description Matches any RDF term (URI, Literal, Blank node) Matches a named variable (e.g. ?x) Matches a URI Matches an RDF literal Matches an RDF blank node (in the data) Matches anything - a term or variable
From the example above, (VAR :identifier TERM) will match ?x :identifier 1234.
(TERM :name VAR) will match ?x :name ?name when in a potential plan where the :identifier because ?x will be a bound term at that point but not if this triple pattern is considered first.
When searching for a weighting of a triple pattern, the first rule to match is taken. The rule which says an RDF graph is a set of triples:
((TERM TERM TERM) 1)
is always implicitly present. does not match a blank node in the query (which is a variable and matches VAR) but in the data, if it is known that slot of a triple pattern is a blank node.
BNODE
where for small graphs (less that 100 triples) X=2, Y=4 but Y=40 if the predicate is rdf:type and 2, 10, 1000 for large graphs. Use of "VAR rdf:type Class" can be a quite unselective triple pattern and so there is a preference to move it later in the order of execution to allow more selective patterns reduce the set of possibilities first. The astute reader may notice that ontological information may render it unnecessary (the domain or range of another property implies the class of some resource). TDB does not currently perform this optimization. These number are merely convenient guesses and the application can use the full rules language for detailed control of pattern weightings.
is used when no matches from other rules (abbreviated or full) when matching a triple pattern that has a URI in the predicate position. If a rule of this form is absent, the default is to place the triple pattern after all known triple patterns; this is the same as specifying -1 as the number. To declare that the rules are complete and no other predicates occur in the data, set this to 0 (zero) because the triple pattern can not match the data (the predicate does not occur).
Writing Rules
Rule for an inverse functional property:
((VAR :ifp TERM) 1 )
and even if a property is only approximately identifying for resources (e.g. date of birth in a small dataset of people), it is useful to indicate this. Because the counts needed are only approximations so the optimizer can choose one order over another, and does not need to predicate exact counts, rules that are usually right but may be slightly wrong are still useful overall. Rules involving rdf:type can be useful where they indicate whether a partciualr class is common or not. In some datasets
((VAR rdf:type class) ...)
may help little because a property whose domain is that class, or a subclass, may be more elective. SO a rule like:
((VAR :property VAR) ...)
is more selective. In other datasets, there may be many classes, each with a small number of instances, in which case
((VAR rdf:type class) ...)
TDB/Joseki Integration
From Jena Wiki Jump to: navigation, search Joseki uses an RDF dataset description. Using a TDB graph in a Joseki server instance is a matter of putting the graph in a dataset as in the example below where it is the default graph of the dataset. Full assembler details on the TDB assembler description page. A simple example that publishes one TDB-backed dataset at a SPARQL end-point:
[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" . tdb:DatasetTDB rdfs:subClassOf ja:RDFDataset . tdb:GraphTDB rdfs:subClassOf ja:Model . <#dataset> rdf:type tdb:location "DB" ; . tdb:DatasetTDB ;
SDB
From Jena Wiki Jump to: navigation, search SDB is a component of Jena for the RDF storage and query specifically to support SPARQL. The storage is provided by an SQL database and many databases are supported, both Open Source and proprietary. An SDB store can be accessed and managed with the provided command line scripts and via the Jena API.
Contents
[hide]
Downloads
SDB is distributed from the Jena project SourceForge. SDB Download area
Documentation
SDB Installation Quickstart Command line utilities Store Description format Dataset And Model Descriptions Use from Java Specialized configuration Database Layouts FAQ Joseki Integration Databases supported
Subversion
SDB Subversion repository on SourceForge.
Support
Support and questions
Details
Database Notes
Databases supported PostgreSQL MySQL Oracle Microsoft SQL Server DB2 Derby HSQLDB H2
SDB/Installation
From Jena Wiki Jump to: navigation, search A suitable database must be installed separately. Any database installation should be tuned according to the database documentation. The SDB distribution is zip file of a directory hierarchy. Unzip this. You may need to run chmod u+x on the scripts in the bin/ directory. Write a sdb.ttl store description: there are examples in the Store/ directory. A database must be created before the tests can be run. Microsoft SQL server and PostgreSQL need specific database options set when a database is created. To use in a Java application, put all the jar files in lib/ on the build and classpath of your application. See the Java API. To use command line scripts, see the scripts page including setting environment variables SDBROOT, SDB_USER, SDB_PASSWORD and SDB_JDBC.
bin/sdbconfig --sdb=sdb.ttl --create
SDB/Quickstart
From Jena Wiki Jump to: navigation, search SDB provides some command line tools to work with SDB triple stores. In the following it assumed that you have a store description set up for your database (sdb.ttl). See the store description format for details. The Store/ directory for some examples.
$ $ $ $ $
Be aware that this will wipe existing data from the database.
$ sdbconfig --sdb sdb.ttl --format
This creates a basic layout. It does not add all indexes to the triple table, which may be left until after loading.
Loading data
You might want to add the --verbose flag to show the load as it progresses.
Adding indexes
You need to do this at some point if you want your queries to execute in a reasonable time.
$ sdbconfig --sdb sdb.ttl --index
Query
$ sdbquery --sdb sdb.ttl 'SELECT * WHERE { ?s a ?p }' $ sdbquery --sdb sdb.ttl --file query.rq
SDB/Commands
From Jena Wiki Jump to: navigation, search This page describes the command line programs that can be used to create an SDB store, load data into it and to issue queries.
Contents
[hide]
1 Scripts o 1.1 Script set up o 1.2 Argument Structure 2 Store Description o 2.1 Modifying the Store Description o 2.2 Logging and Monitoring 3 SDB Commands o 3.1 Database creation o 3.2 Loading data o 3.3 Query o 3.4 Testing o 3.5 Other
Scripts
The directory bin/ contains shell scripts to run the commands from the command line. The scripts are bash scripts which also run over Cygwin. Script set up Set the environment variable SDBROOT to the root of the the SDB installation. A store description can include naming the class for the JDBC driver. Getting a Store object from a store description will automatically load the JDBC driver from the classpath. When running scripts, set the environment variable SDB_JDBC to one or more jar files for JDBC drivers. If it is more than one jar file, use the classpath syntax for your system. You can also use the system property jdbc.drivers. Set the environment variables SDB_USER and SDB_PASSWORD to the database user name and password for JDBC.
$ $ $ $ export export export export SDBROOT="/path/to/sdb SDB_USER="YourDbUserName" SDB_PASSWORD="YourDbPassword" SDB_JDBC="/path/to/driver.jar"
They are bash scripts, and work on Linux and Cygwin for MS Windows.
$ export PATH=$SDBROOT/bin:$PATH
Alternatively, there are wrapper scripts in $SDBROOT/bin2 which can be placed in a convenient directory that is already on the shell command path. Argument Structure All commands take a SDB store description to extract the connection and configuration information they need. This is written SPEC in the command descriptions below but it can be composed of several arguments as described here. Each command then has command-specific arguments described below. All commands support --help to give details of named and positional arguments. There are two equivalent forms of named argument syntax:
--arg=val --arg val
Store Description
If this is not specified, commands load the description file sdb.ttl from the current directory.
--sdb=<sdb.ttl>
This store description is a Jena assembler file. The description consists of two parts; a store description and a connection description. Often, this is all that is needed to describe which store to use. The individual components of a connection or configuration can be overridden after the description have been read, before it is processed. The directory Store/ has example assmembler files. The full details of the assembler file is given in 'SDB/Store Description' Modifying the Store Description The individual items of a store description can be overridden by various command arguments. The description in the assembler file is read, then any comand lien arguments used to modify the description, then the appropriate object is created from the modified description. Set the layout type:
--layout : layout name
Th host name can host or host:port. The better way to handle passwords is to use environment variables SDB_USER and SDB_PASSWORD because then the user/password is not stored in a visible way. Logging and Monitoring All commands take the following arguments (although they may do nothing if they make no sense to the command).
-v
Be verbose.
--time
Print timing information. Treat with care - while the timer avoids recording JVM and some class loading time, it can't avoid all class loading. Hence, the values of timing are more meaningful on longer operations. JDBC operation times to a remote server can also be a significant proportion in short operations.
--log=[all|none|queries|statements|exceptions]
to log SQL actions on the database connection (but not the prepared statements used by the loader). Can be repeated on the command line.
SDB Commands
Database creation
sdbconfig SPEC [--create|--format|--indexes|--dropIndexes]
Setup a database.
--create
formats the store and sets up indexes just formats the store and creates indexes for loading, not querying. Create indexes for querying Drop indexes for querying.
--format
--indexes
--dropIndexes
Loading large graphs can be faster by formatting, loading the data, then building the query indexes with this command.
sdbtruncate SPEC
Load RDF data into a store using the SDB bulk loader. Data is streamed into the database and is not loaded as a single transaction. The file's extension is used to determine the data syntax. To load into a named graph:
sdbload SPEC --graph=URI FILE [FILE ...]
Query
sdbquery SPEC --query=FILE
Execute a query.
sdbprint SPEC --print=X [--sql] --query=FILE
Print details of a query. X is any of query, op, sqlNode, sql or plan. --print=X can be repeated. --sql is short for -print=sql. The default is --print=sql. Testing
sdbtest SPEC MANIFEST
Execute a test manifest file. The manifest of all query tests, which will test connection and loading of data, is in SDBROOT/testing/manifest-sdb.ttl. Other
sdbdump SPEC --out=SYNTAX
Dump the contents of a store N-TRIPLES or a given serialization format (usual Jena syntax names, e.g. Turtle or TTL).
Only suitable for data sizes that fit in memory. All output syntaxes that do some form of pretty printing will need additional space for their internal datastructures.
sdbsql SPEC [ --file=FILE | SQL string ]
Execute a SQL command on the store, using the connection details from the store specification. The SQL command either comes from file FILE or the command line as a string.
sdbinfo SPEC
Details of a store.
sdbmeta SPEC --out=SYNTAX
Many of the tables used within SDB are tuples of RDF nodes. This command allows low-level access to these tuple tables. Misuse of this command can corrupt the store.
SDB/Store Description
From Jena Wiki Jump to: navigation, search Use of an SDB store requires a Store object which is described in two parts:
These can be built from a Jena assembler description. Store objects themselves are lightweight so connections to an SDB database can be created on a per-request basis as required for use in J2EE application servers.
Contents
[hide]
1 Store Descriptions 2 SDB Connections 3 Example 4 Vocabulary o 4.1 Store o 4.2 Connection
Store Descriptions
A store description identifies which storage layout is being used, the connection to use and the database type.
[] rdf:type sdb:Store ; sdb:layout "layout2" ; sdb:connection <#conn> . <#conn> ...
SDB Connections
SDB connections, objects of class SDBConnection, abstract away from the details of the connection and also provide consist logging and transaction operations. Currently, SDB connections encapsulate JDBC connections but other connection technologies, such as direct database APIs, can be added.
Example
The sdbType is needed for both a connection and for a store description. It can be given in either part of the complete store description. If it is specified in both places, it must be the same.
@prefix @prefix @prefix @prefix rdfs: rdf: ja: sdb: <http://www.w3.org/2000/01/rdf-schema#> . <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . <http://jena.hpl.hp.com/2005/11/Assembler#> . <http://jena.hpl.hp.com/2007/sdb#> .
<#myStore> rdf:type sdb:Store ; sdb:layout "layout2" ; sdb:connection <#conn> ; . <#conn> rdf:type sdb:SDBConnection ; sdb:sdbType "derby" ; sdb:sdbName "DB/SDB2" ; sdb:driver "org.apache.derby.jdbc.EmbeddedDriver" ; .
Examples of assembler files are to be found in the Store/ directory in the distribution.
Vocabulary
Store The value of sdbType needed for the connection also applies to choosing the store type.
layout
The type of the database (e.g. "oracle", "MSSQLServerExpress", "postgresql", "mysql"). Controls both creating the JDBC URL, if not given explicitly, and the store type.
sdbName
Host name for the database server. Include :port to change the port from the default for the database.
sdbUser sdbPassword
Database user name and password. The environment variables SDB_USER and SDB_PASSWORD are a better way to pass in the user and password because they are not then written into store description files. In Java programs, the system properties jena.db.user and jena.db.password can be used.
driver
The JDBC driver class name. Normally, the system looks up the sdbType to find the driver. Setting this property overrides that choice.
jdbcURL
If necessary, the JDBC URL can be set explicitly, not constructed by SDB. The sdbType is still needed.
SDB/Store Description
From Jena Wiki Jump to: navigation, search Use of an SDB store requires a Store object which is described in two parts:
These can be built from a Jena assembler description. Store objects themselves are lightweight so connections to an SDB database can be created on a per-request basis as required for use in J2EE application servers.
Contents
[hide]
1 Store Descriptions 2 SDB Connections 3 Example 4 Vocabulary o 4.1 Store o 4.2 Connection
Store Descriptions
A store description identifies which storage layout is being used, the connection to use and the database type.
[] rdf:type sdb:Store ; sdb:layout "layout2" ; sdb:connection <#conn> . <#conn> ...
SDB Connections
SDB connections, objects of class SDBConnection, abstract away from the details of the connection and also provide consist logging and transaction operations. Currently, SDB connections encapsulate JDBC connections but other connection technologies, such as direct database APIs, can be added.
Example
The sdbType is needed for both a connection and for a store description. It can be given in either part of the complete store description. If it is specified in both places, it must be the same.
@prefix @prefix @prefix @prefix rdfs: rdf: ja: sdb: <http://www.w3.org/2000/01/rdf-schema#> . <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . <http://jena.hpl.hp.com/2005/11/Assembler#> . <http://jena.hpl.hp.com/2007/sdb#> .
<#myStore> rdf:type sdb:Store ; sdb:layout "layout2" ; sdb:connection <#conn> ; . <#conn> rdf:type sdb:SDBConnection ; sdb:sdbType "derby" ; sdb:sdbName "DB/SDB2" ; sdb:driver "org.apache.derby.jdbc.EmbeddedDriver" ; .
Examples of assembler files are to be found in the Store/ directory in the distribution.
Vocabulary
Store The value of sdbType needed for the connection also applies to choosing the store type.
layout
The type of the database (e.g. "oracle", "MSSQLServerExpress", "postgresql", "mysql"). Controls both creating the JDBC URL, if not given explicitly, and the store type.
sdbName
Host name for the database server. Include :port to change the port from the default for the database.
sdbUser sdbPassword
Database user name and password. The environment variables SDB_USER and SDB_PASSWORD are a better way to pass in the user and password because they are not then written into store description files. In Java programs, the system properties jena.db.user and jena.db.password can be used.
driver
The JDBC driver class name. Normally, the system looks up the sdbType to find the driver. Setting this property overrides that choice.
jdbcURL
If necessary, the JDBC URL can be set explicitly, not constructed by SDB. The sdbType is still needed.
SDB/JavaAPI
From Jena Wiki Jump to: navigation, search This page describes how to use SDB from Java. Code examples are in src-examples/ in the SDB distribution.
Contents
[hide]
1 Concepts 2 Obtaining the Store o 2.1 From a configuration file o 2.2 In Java code o 2.3 Database User and Password 3 Connection Management 4 Formatting or Emptying the Store 5 Loading data 6 Executing Queries 7 Using the Jena Model API with SDB
Concepts
Store SDBFactory SDBConnection
SDB loads and queries data based on the unit of a Store. The Store object has all the information for formatting, loading and accessing an SDB database. One database or table space is one Store. Store objects are made via the static method of the StoreFactory class.
SDBConnection
StoreDesc
A store description is the low level mechanism for describing stores to be created.
DatasetStore GraphSDB
Two further class are not immediately visible because they are managed by the SDBFactory which creates the necessary classes, such as Jena models and graphs. An object of class DatasetStore represents an RDF dataset backed by an SDB store. Objects of this class trigger SPARQL queries being sent to SDB. The class GraphSDB provides the adapter between the standard Jena Java API and an SDB store, either to the default graph or one of the named graphs. The SDBFactory can also create Jena Models backed by such a graph.
The assembler file has two parts, the connection details and the store type.
@prefix @prefix @prefix @prefix rdfs: rdf: ja: sdb: <http://www.w3.org/2000/01/rdf-schema#> . <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . <http://jena.hpl.hp.com/2005/11/Assembler#> . <http://jena.hpl.hp.com/2007/sdb#> .
_:c rdf:type sdb:SDBConnection ; sdb:sdbType "derby" ; sdb:sdbName "DB/SDB2" ; sdb:driver "org.apache.derby.jdbc.EmbeddedDriver" ; . [] rdf:type sdb:Store ; sdb:layout "layout2" ; sdb:connection _:c ; .
See the full details of store description files for the options. In Java code The less flexible way to create a store description is to build it in Java. For example:
StoreDesc storeDesc = new StoreDesc(LayoutType.LayoutTripleNodesHash, DatabaseType.Derby) ; JDBC.loadDriverDerby() ; String jdbcURL = "jdbc:derby:DB/SDB2"; SDBConnection conn = new SDBConnection(jdbcURL, null, null) ; Store store = SDBFactory.connectStore(conn, storeDesc) ;
Database User and Password The user and password for the database can be set in explicitly in the description file but it is usually better to use an environment variable or Java system property because this avoid writing use and password in a file. Environment variable: SDB_USER Java property: jena.db.user Environment variable: SDB_PASSWORD Java property: jena.db.password
Connection Management
Each store has a JDBC connection associated with it.
In situations where such connections are managed externally, the store object can be created and used within a single operation. A Store is lightweight and does not perform any database actions when created, so creating and releasing them will not impact performance. Closing a store does not close the JDBC connection. Similarly, a SDBConnection is lightweight and creation does not result in any database or JDBC connection actions. The store description can be read from the same file because any SDB connection information is ignored when reading to get just the store description. The store description can be kept across store creations:
storeDesc = StoreDesc.read("sdb.ttl") ;
then used with an JDBC connection object passed from the connection container:
public static void query(String queryString, StoreDesc storeDesc, Connection jdbcConnection) { Query query = QueryFactory.create(queryString) ; SDBConnection conn = SDBFactory.createConnection(jdbcConnection) ; Store store = SDBFactory.connectStore(conn, storeDesc) ; Dataset ds = SDBFactory.connectDataset(store) ; QueryExecution qe = QueryExecutionFactory.create(query, ds) ; try { ResultSet rs = qe.execSelect() ; ResultSetFormatter.out(rs) ; } finally { qe.close() ; } store.close() ; }
This is an expensive operation, and should be used sparingly. Once you obtain a store for the first time you will need to:
store.getTableFormatter().create();
This will create the necessary tables and indexes required for a full SDB store. You may empty the store completely using:
store.getTableFormatter().truncate();
Loading data
Data loading uses the standard Jena Model.read operations. GraphSDB, and models made from a GraphSDB, implement the standard Jena bulk data interface with backed by an SBD implementation of that interface.
Executing Queries
The interface to making queries with SDB is same as that for querying with ARQ. SDB is an ARQ query engine that can handle queries made on an RDF dataset which is of the SDB class DatasetStore:
Dataset ds = DatasetStore.create(store) ;
When finished, the store should be closed to release any resources associated with the particualr implementation. Closing a store does not close it's JDBC connection.
store.close() ;
If models or graphs backed by SDB are placed in a general Dataset then the query is not efficiently executed by SDB.
SDB is optimized for SPARQL queries but queries and other Jena API operations can be mixed. The results from a SPARQL query are Jena RDFNodes, with the associated model having a graph implemented by SDB.
SDB/Configuration
From Jena Wiki Jump to: navigation, search This page describe the configuration options available. These are options for query processing, not for the database layout and storage, which is controlled by store descriptions.
Contents
[hide]
1 Setting Options 2 Current Options o 2.1 Queries over all Named Graphs o 2.2 Streaming over JDBC o 2.3 Annotated SQL
Setting Options
Options can be set globally, through out the JVM, or on a per query execution basis. SDB uses the same mechanism as ARQ. There is a global context, which is give to each query query execution as it is created. Modifications to the global context after the query execution is created are not seen by the query execution. Modifications to the context of a single query execution do not affect any other query execution nor the global context. A context is a set of symbol/value pairs. Symbols are used created internal to ARQ and SDB and access via Java constants. Values are any Java object, together with the values true and false, which are short for the constants of class java.lang.Boolean. Setting globally:
SDB.getContext().set(symbol, value) ;
Setting for a query execution happens before any query compilation or setup happens. Creation of a query execution object does not compile the query, which happens when the appropriate .exec method is called.
Current Options
Symbol Effect Query patterns on the default graph match against the union of the named graphs. Attempt to stream JDBC results. Set the JDBC fetch size on the SQL query statements. Must be >= 0. Stream Jena APIs (also requires jdbcStream and jdbcFetchSize Put SQL comments in SQL Default
SDB.unionDefaultGraph
false
SDB.jdbcStream
true
SDB.jdbcFetchSize
unset
SDB.streamGraphAPI
false true
SDB.annotateGeneratedSQL
Queries over all Named Graphs experimental feature All the named graphs can be treated as a single graph in two ways: either set the SDB option above or use the URI that refers to RDF merge of the named graphs (urn:x-arq:UnionGraph). When querying the RDF merge of named graph, the default graph in the store is not included. This feature applies to queries only. It does not affect the storage nor does it change loading. To find out which named graph a triple can be found in, use GRAPH as usual.
The following special IRIs exist for use as a graph name in GRAPH only:
<urn:x-arq:DefaultGraph> the default graph, even when option for named union queries is set. <urn:x-arq:UnionGraph> the union of all named graphs, even when the option for named union
queries is not set. Streaming over JDBC By default, SDB processes results from SQL statements in a streaming fashion. It is important to close query execution objects, especially if not consuming all the results, because that causes the underlying JDBC result set to be closed. JDBC result sets can be a scarce system resource. If this option is set, but the JDBC connection is not streaming, then this feature is harmless. Setting it false caused SDB to read all results of an SQL statement at once, treating streamed connections as unstreamed. Note that this only streams results end-to-end if the underlying JDBC connection itself is set up to stream. Most do not in the default configuration to reduce transient resource peaks on the server under load. Setting the fetch size enables cursors in some databases but there may be restrictions imposed by the database. See the documentation for your database for details. In addition, operations on the graph API can be made streaming by also setting the Graph API to streaming. Annotated SQL SQL generation can include SQL comments to show how SPARQL has been turned into SQL. This option is true by default and always set for the command sdbprint.
SDB.getContext().setFalse(SDB.annotateGeneratedSQL) ;
SDB/Database Layouts
From Jena Wiki Jump to: navigation, search SDB does not have a single database layout. This page is an informal overview of the two main types ("layout2/hash" and "layout2/index"). In SDB one store is one RDF dataset is one SQL database. Databases of type layout2 have a triples table for the default graph, a quads table for the named graphs. In the triples and quads tables, the columns are integers referencing a nodes table. In the hash form, the integers are 8-byte hashes of the node. In the index form, the integers are 4-byte sequence ids into the node table. Triples
+-----------+ | S | P | O | +-----------+
Quads
+---------------+ | G | S | P | O | +---------------+
Primary key: GSPO Indexes: GPO, GOS, SPO, OS, PO. Nodes In the index-based layout, the table is:
+------------------------------------------------+ | Id | Hash | lex | lang | datatype | value type | +------------------------------------------------+
Primary key: Hash All character fields are unicode, supporting any character set, including mixed language use.
SDB/FAQ
From Jena Wiki Jump to: navigation, search Tune your database Database performance depends on the database being tuned. Some databases default to "developer setup" which does not use much of the RAM but is only for functional testing. Improving loading rates For a large bulk load into an existing store, dropping the indexes, doing the load and then recreating the indexes can be noticeably faster.
sdbconfig --drop sdbload file sdbconfig --index
For a large bulk load into an new store, just format it, and not create the indexes, do the load and then recreating the indexes can be noticeably faster.
sdbconfig --format sdbload --time file sdbconfig --index
SDB/Joseki Integration
From Jena Wiki Jump to: navigation, search Joseki is a server that implements the SPARQL protocol for HTTP. It can be used to give a SPARQL interface to an SDB installation. The Joseki server needs the SDB jar files on it's classpath. The Joseki configuration file needs to contain two triples to integrate SDB:
## Initialize SDB. [] ja:loadClass "com.hp.hpl.jena.sdb.SDB" . ## Declare that sdb:DatasetStore is an implementation of ja:RDFDataset . sdb:DatasetStore rdfs:subClassOf ja:RDFDataset .
To enable pooling of connections to the SDB store, usethe joseki:poolSize property. This causes Joseki to create a pool of SDB datasets, each with it's own JDBC connection. This requires Joseki 3.2.
<#sdb> rdf:type sdb:DatasetStore ; joseki:poolSize 5 ; sdb:store <#store> . # Number of concurrent connections allowed to this dataset.
SDB 1.0:
## Dataset in SDB. <#books> rdf:type sdb:DatasetStore , ja:RDFDataset ; rdfs:label "Books" ; sdb:layout "layout2" ; sdb:connection [ rdf:type sdb:SDBConnection ; sdb:sdbType "postgresql" ; sdb:sdbHost "localhost" ; sdb:sdbName "SDB" ; ] .
The database installation does not need to accept public requests, it needs only to be accessible to the Joseki server itself. There is an example configuration file for a Joseki server using SDB in the Joseki distribution.
SDB/Databases Supported
From Jena Wiki Jump to: navigation, search Supported Databases Oracle 10g Including OracleXE
Microsoft SQL Server 2005 Including MS SQL Express DB2 9 PostgreSQL MySQL Apache Derby H2 HSQLDB Including DB2 9 Express v8.2 v5.0.22 v10.2.2.0 1.0.71 1.8.0
Support for a version implies support for later versions unless otherwise stated. Microsoft SQL Server 2000 is also reported to work. H2 support was contributed by Martin Hein (March 2008). IBM DB2 support was contributed by Venkat Krishnamurthy (October 2007). Please report earlier versions that also work.
SDB/NotesPostgreSQL
From Jena Wiki Jump to: navigation, search
Contents
[hide]
Set this when creating the database with pgAdmin or if you use the command line, for example:
CREATE DATABASE "YourStoreName" WITH OWNER = "user" ENCODING = 'UTF8' TABLESPACE = pg_default;
Drop indexes
sdbconfig --drop
Load data
sdbload file
sdbconfig --index
Fresh store PostgreSQL needs statistics to improve load performance through the use of ANALYSE. When loading the first time, there are no statistics so, for a large load, it is advisable to load a sample, run ANALYSE and then load the whole data.
sdbconfig --format
the load rate starts to drop appreciably). The sample must be representative of the data.
sdbload --time sample
Run ANALYZE on the database. If your sample is one part of a large set of files, this set is not necessary at all. If you are loading one single large file then you might wish to empty the database. This is only needed if the data has bNodes in
Add the indexes. This only takes a few minutes even on a very large store but calculating them during loading (that is, --create, not --format) is noticeably slower.
sdbconfig --index
Tuning
It is essential to run the PostgreSQL ANALYZE command on a database, either during or after building. This is done via the command line psql or via pgAdmin. The PostgreSQL documentation describes ways to run this as a background daemon. Various of the PostgreSQL configuration parameters will affect performance, particularly effective_cache_size. The parameter enable_seqscan may help avoid some unexpected slow queries.
SDB/NotesMySQL
From Jena Wiki Jump to: navigation, search
Contents
[hide]
National Characters
SDB formats all table columns used for storing text in the MySQL schema to UTF-8. However, this does not cause the data to be transmitted in UTF-8 over the JDBC connection. The best way is to run the server with a default character set of UTF-8. This is set in the MySQL server configuration file:
[mysql] default-character-set=utf8
A less reliable way is to pass parameters to the JDBC driver in the JDBC URL. The application will need to explicitly set the JDBC URL in the store configuration file.
...?useUnicode=true&characterEncoding=UTF-8
Connection timeouts
If you get the connection timing out after (by default) 8 hours of no activity, try setting autoReconnect=true in the JDBC URL.
Tuning
1. For InndoDB, the critical parameter is innodb_buffer_pool_size. See the MySQL sample configuration files for details. 2. Using ANALYZE TABLE on the database tables can improve the choices made by the MySQL optimizer.
.. README.txt branches/
tags/ trunk/
SDB/Support
From Jena Wiki Jump to: navigation, search Variations in the setups of databases make it hard to diagnose problems by email without precise and detailed information. Questions should be sent to: <mailto:jena-dev@groups.yahoo.com> Include a description of the database setup, version numbers of all software and the database. A complete, minimal example should be included. Complete means sufficient information, to exactly recreate the situation. Minimal means no more data than is needed to demonstrate the situation. Large data files are not minimal; it is very rare that the problem can not be illustrated with less than 10 triples. Queries should be as the parser sees them, not as Java strings. Use the command line tools to produce the complete minimal example. If a query gives the unexpected results, include some data; the query, stripped to the non-performing part; and a description of the expected results. You should compare the results by running ARQ's reference query engine first.
arq.sparql --engine=ref --data=datafile --query=queryfile
SDB/Loading data
From Jena Wiki Jump to: navigation, search There are three ways to load data into SDB: 1. Use the command utility sdbload 2. Use one of the Jena model.read operations 3. Use the Jena model.add The last one of these requires the application to signal the beginning and end of batches.
will automatically bulk load data for each call of one of the Model.read operations.
Failing to notify the end of the operations will result in data loss. A try/finally block can ensure that the finish is notified.
model.notifyEvent(GraphEvents.startRead) ; try { ... do add/remove operations ... } finally { model.notifyEvent(GraphEvents.finishRead) ; }
The model.read operations do this automatically. The bulk loader will automatically chunk large sequences of additions to sizes appropriate to the underlying database. The bulk loader is threaded with double-buffered; loading to the database happens in parallel to the application thread and any RDF parsing.
You should set these before the loader has been used. Each loader sets up two temporary tables (NNode and NTrip) that mirror Nodes and Triples tables. These tables are virtually identical, except that a) they are not indexed and b) for the index variant there is no index column for nodes. When loading prepared triples -- triples that have been broken down ready for the database -- are passed to the loader core (normally running on a different thread). When the chunk size is reached, or we are out of triples, the following happens:
Prepared nodes are added in one go to NNode. Duplicate nodes within a chunk are suppressed on the java side (this is worth doing since they are quite common, e.g. properties). Prepared triples are added in one go to NTrip. New nodes are added to the node table (duplicate suppression is explained below). New triples are added to the triple table (once again suppressing dupes). For the index case this involves joining on the node table to do a hash to index lookup. We commit. If anything goes wrong the transaction (the chunk) is rolled back, and an exception is thrown (or readied for throwing on the calling thread).
Thus there are five calls to the database for every chunk. The database handles almost all of the work uninterrupted (duplicate suppression, hash to index lookup), which makes loading reasonably quick.
Duplicate Suppression
MySQL has a very useful INSERT IGNORE, which will keep going, skipping an offending row if a uniqueness constraint is violated. For other databases we need something else. Having tried a number of options the best seems to be to INSERT new items by LEFT JOIN new items to existing items, then filtering WHERE (existing item feature) IS NULL. Specifically, for the triple hash case (where no id lookups are needed):
INSERT INTO Triples SELECT DISTINCT NTrip.s, NTrip.p, NTrip.o -- DISTINCT because new nodes may contain duplicates (not so for nodes) NTrip LEFT JOIN Triples ON (NTrip.s=Triples.s AND NTrip.p=Triples.p AND NTrip.o=Triples.o) WHERE Triples.s IS NULL OR Triples.p IS NULL OR Triples.o IS NULL
SDB/Loading performance
From Jena Wiki Jump to: navigation, search
Contents
[hide]
1 Introduction 2 The Databases and Hardware o 2.1 Hardware o 2.2 Windows setup o 2.3 Linux setup 3 The Dataset and Queries o 3.1 LUBM o 3.2 dbpedia 4 Loading 5 Results 6 Uniprot 700m loading: Tuning Helps
Introduction
Performance reporting is an area prone to misinterpretation, and such reports should be liberally decorated with disclaimers. In our case there are an alarming number of variables: the hardware, the operating system, the database engine and its myriad parameters, the data itself, the queries, and planetary alignment. Given this here is some basic information. You may find it sufficient:
Loading speed will be in the thousands of triples per second range. Expect to load around 5 million triples per hour. Index layout is usually better than hash for loading speed. Hash loading is very bad on MySQL. Hash layout is better for query speed.
We suggest that you don't choose your database based on these figures. The performance is broadly similar, so if you already have a relational database installed this is your best option.
Hardware
Dual AMD Opteron processors, 64 bit, 1.8 GHz. 8 GB memory. 80 GB disk for database.
Windows setup
Linux setup
dbpedia The dbpedia queries are, unlike LUBM, quite ground. dbpedia contains many large literals, in contrast to LUBM.
Loading
All operations were performed using SDB's command line tools. The data was loaded into a freshly formatted SDB store -- although postgresql needs an ANALYSE to avoid silly planning -- then the additional indexes were added.
Results
Database loading Speed (tps) Index time (s) Size (MB) LUBM Postgres (Hash) LUBM Postgres (Index) 4972 8658 199 176 121 68 298 5124 3666 3200 2029 10193
LUBM SQLServer (Hash) 8762 LUBM SQLServer (Index) 7419 DBpedia Postgres (Hash) 3029
DBpedia Postgres (Index) 4293 DBpedia SQLServer (Hash) 5345 DBpedia SQLServer (Index) 4749
The database was stored on a separate disk. The database's transactional logs were stored on yet another disk.
So the rdf data, database data, and log data were all on distinct disks. Loading into an index-layout store proceeded at:
SDB/Query
From Jena Wiki Jump to: navigation, search SDB supports various layouts but the overall process of compiling a query is common to all layouts. Each layout provides the concrete, basic steps in mapping a query into the concrete database details.
sdbquery
can be used to print out the SQL that would be used for a query.
SDB/Query performance
From Jena Wiki Jump to: navigation, search This page compares the effect of SDB with RDB, Jena's usual database layout. RDB was designed for supporting the fine-grained API calls as well as having some support for basic graph patterns. Therefore, the RDB design goals were not those of SDB. RDB uses a denormalised database layout in order that all statement-level operations do not require additional joins. The SDB layout is normalised so that the triple table is narrower and uses integers for RDF nodes, then does do joins to get the node representation. This optimizers for longer patterns, not API operations. These figures were taken July 2007. As with any performance figures, these should be taken merely as a guide. The shape of the data, the hardware details, choice of database, and it's configuration (particuarly amount of memory used), as well as the queries themselves all greatly contribute to the execution costs.
Contents
[hide]
Setup
Database and hardware setup was the same as for the load performance tests. Data was taken generated with the LUBM test generator (with N = 15), then the inference expanded on loading to give about 19.5 million triples. This data is larger than the database could completely cache. The queries are taken the LUBM suite and rewritten in SPARQL.
LUBM Query 1
PREFIX PREFIX SELECT { ?x ?x } rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#> * WHERE rdf:type ub:GraduateStudent . ub:takesCourse <http://www.Department0.University0.edu/GraduateCourse0> .
LUBM Query 2
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#> SELECT * WHERE { ?x rdf:type ub:GraduateStudent . ?y rdf:type ub:University . ?z rdf:type ub:Department . ?x ub:memberOf ?z . ?z ub:subOrganizationOf ?y . ?x ub:undergraduateDegreeFrom ?y . }
This query searches for a particular pattern in the data without specific starting point. Jena: 232.1s (153s with an addition index on OP) SDB/index: 12.7s SDB/hash: 3.7s Notes: Removing the rdf:type statements actually slows the query down.
Summary
In SPARQL queries, there is often a sufficiently complex graph pattern that the SDb design tradeoff provides significant advantages in query performance.