Академический Документы
Профессиональный Документы
Культура Документы
Sponsored by
Copyright 2014 R20/Consultancy. All rights reserved. Red Hat, Inc., Red Hat, Red Hat Enterprise Linux,
the Shadowman logo, and JBoss are trademarks of Red Hat, Inc., registered in the U.S. and other
countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries.
Trademarks of companies referenced in this document are the sole property of their respective owners.
Table of Contents
1
Management Summary
13
15
15
17
21
21
1 Management Summary
It didnt happen overnight, but a new era for business intelligence (BI) has arrived. Gone are the days
when BI systems presented yesterdays data, when internal data was sufficient to cover all the
organizations BI needs, and when development of new reports took a few weeks or even months.
Today, organizations rely much more on BI systems than ever before.
Having access to the right data and at the right time in the right form is
crucial for decision making processes in organizations in todays fast
moving world of business. BI has become a key instrument for
organizations to distinguish themselves and to stay competitive.
In this new era, BI systems have to change, because theyre confronted with new technological
developments and new business requirements. These are some of the key challenges that BI systems are
facing:
Productivity improvement: Because the speed of business continues to increase, the productivity
of BI developers has to improve as well. BI development must follow this speed of business.
Taking a few weeks to develop a report is not acceptable anymore.
Self-service BI: BI systems have to support self-service BI tools that allow users to develop and
maintain their own reports.
Operational intelligence: Users want to analyze operational data, not yesterdays data. This form
of analytics is called operational intelligence or real-time analytics.
Big data, Hadoop, and NoSQL: Undoubtedly, one of the
Allow users to exploit big data
biggest changes in the BI industry is initiated by big data. BI
for reporting and analytics as
systems must embrace big data together with the
easily as small data.
accompanying Hadoop and NoSQL data storage technologies.
The challenge is to allow users to use big data for reporting and analytics as easily as the data
stored in classic systems.
Systems in the cloud: Organizations are migrating BI system components to the much-discussed
cloud. BI systems must be able to embrace cloud technology and cloud solutions in a transparent
way.
Data in the cloud: Analytical capabilities can be extended by enriching internal data with external
data. On the internet, countless sources containing valuable external data are available, such as
social media data and the numerous open data sources. The challenge for BI systems is to
integrate all this valuable data in the cloud with internal enterprise data.
For many current BI systems it will be difficult to embrace all these challenges. The main reason is that
their internal architectures are made up of a chain of databases, consisting of staging areas, data
warehouses, and data marts. Data is made available to users by copying it from one database to another
and with each copy-process the shape and form of the data gets closer to what the users require. This
data supply chain has served many organizations well for many years, but is now becoming an obstacle. It
was designed with a built to last mentality, however, organizations are asking for solutions that are built
with a designed for change approach.
This whitepaper describes a lean form of on-demand data integration
Deploying data virtualization
technology called data virtualization. Deploying data virtualization
results in BI systems with
results in BI systems with simpler and more agile architectures that can
simpler and more agile
confront the new challenges much easier. All the key concepts of data
architectures.
virtualization are described, including logical tables, importing data
sources, data security, caching, and query optimization. Examples are
given of application areas of data virtualization for BI, such as virtual data marts, big data analytics,
extended data warehouse, and offloading cold data.
The whitepaper ends with an overview of the first open source data virtualization server: Red Hats JBoss
Data Virtualization.
Self Service BI The approach used by many IT departments to develop BI reports is an iterative one. It
usually starts with a user requesting a report. Next, a representative from the IT department begins with
analyzing the users needs. This involves interviews and the study of data structures, various reports, and
documents. In most cases, it also involves a detailed analysis process by the IT specialist primarily to
understand what the user is requesting and what all the terms mean. This process of understanding can
be very time-consuming. Eventually, the IT specialist comes up with an initial version of the report, which
is shown to the user for review. If its not what the user wants, the specialist starts to work on a second
version and presents that to the user. This process may involve a number of iterations, depending on how
good the user is in specifying his needs and how good the analyst is in understanding the user and his
1
Aberdeen Group, Agile BI: Three Steps to Analytic Heaven, April 2011, see
https://www.tableausoftware.com/sites/default/files/whitepapers/agile_bi.pdf
Big Data, Hadoop, and NoSQL Undoubtedly, one of the biggest trends in the BI industry is big data. Gartner2
predicts that big data will drive $232 billion in spending through 2016, and Wikibon3 claims that in 2017
big data revenue will have grown to $50.1 billion. Many organizations have adopted big data. There are
those who are studying what big data could mean for them, many are in the process of developing big
data systems, and some are already relying on these systems. And they all do it to enrich their analytical
capabilities.
Whether big data is structured, unstructured, multi-structured, or
BI systems should allow users
semi-structured, its always a massive amount of data. To handle such
to do reporting and analytics
large databases, many organizations have decided not to deploy
on big data as easily as on
familiar SQL systems but Hadoop systems or NoSQL systems, such as
small data.
MongoDB and Cassandra. Hadoop and NoSQL systems are designed
for big data workloads, are powerful and scalable, but are different from the SQL systems. First of all, they
dont always support the popular SQL database language nor the familiar relational concepts, such as
tables, columns, and records. Second, many of them support their own API, database language, and set of
database concepts. So, expertise in one product cant always easily be reused in another.
The challenge for BI will be to integrate all the big data stored in the Hadoop and NoSQL systems with the
data warehouse environment, so that users can use big data for reporting and analytics as easily as they
can small data.
Systems in the Cloud All the software components of a BI system, including the production systems, the
staging area, the data warehouse, and the data marts, used to run on-premises. Nowadays, components
may reside in the cloud.
Moving a component to the cloud can have impact on its performance, the technical interface, security
aspects, and so on. Such changes can require redevelopment. The challenge is to adopt cloud solutions in
a transparent way. For example, when a data mart is moved to the cloud, or when a cloud-based ERP
system is introduced, all this should be as transparent as possible.
production
applications
analytics
& reporting
ETL
production
databases
ETL
staging
area
ETL
operational
data store
ETL
data
warehouse
data marts
ETL jobs are commonly used to copy data from one database to another. They are responsible for
transforming, integrating, and cleansing the data. ETL jobs are scheduled to run periodically. In other
words, data integration and transformation is executed as batch process.
The chain of databases and the ETL jobs that link them together form
a factory that transforms raw data (in the production databases) to
data for reporting purposes. This is very much like a real assembly line
in which raw products are processed, step by step, into end products.
This chain is a data supply chain.
Classic BI Systems and the New Challenges This data supply chain, with
Summary The classic BI architecture consisting of a chain of databases and ETL jobs has served us well for
many years, but its not evident how BI systems, by sticking to this architecture, can face the new
challenges.
Data Virtualization in a Nutshell Data virtualization is a technology for integrating, transforming, and
manipulating data coming from all kinds of data sources and presenting all that data as one unified view
to all kinds of applications. Data virtualization provides an abstraction and encapsulation layer that, for
applications, hides most of the technical aspects of how and where data is stored; see Figure 2. Because of
that layer, applications dont need to know where all the data is physically stored, how the data should be
integrated, where the database servers run, how to insert and update the data, what the required APIs
are, which database language to use, and so on. When data virtualization is deployed, to every application
it feels as if one large database is accessed.
For completeness sake, here is the definition of data virtualization4:
Data virtualization is a technology that offers data consumers a unified, abstracted, and
encapsulated view for querying and manipulating data stored in a heterogeneous set of data
stores.
4
Rick F. van der Lans, Data Virtualization for Business Intelligence Systems, Morgan Kaufmann, 2012.
production
application
analytics
& reporting
internal
portal
mobile
App
website
dashboard
unstructured
data
data
warehouse
& data marts
production
databases
applications
streaming
databases
big data
stores
ESB
private
data
social
media data
external
data
Figure 2 Data virtualization servers make a heterogeneous set of data sources look like one logical database to the
applications.
Data Virtualization Offers On-Demand Integration When an application requests a data virtualization server to
integrate data from multiple sources, the integration is executed on-demand. This is very different from
ETL-based integration, where integration takes place before the application asks for it. In a typical ETL
environment the retrieved data may have been integrated a week before. Not so, with data virtualization
where integration is done live. When an applications asks for data, only then will the data virtualization
server retrieve the required data from the source databases, and integrate, transform, and cleanse it.
Compare this to buying sandwiches. When a customer orders a
sandwich in a restaurant, all the ingredients, such as the sandwich, the
ham, the cheese, and the lettuce, are all integrated in the kitchen
right there and then. Thats data virtualization! ETL compares with
buying a pre-packaged sandwich at a shop where the integration of all
the ingredients was done early in the morning or maybe even the
evening before. Data virtualization is really on-demand data integration.
Transformations and Translations Because data virtualization technology offers access to many different
data source technologies, many different APIs and languages are supported. It must be possible to handle
requests specified in, for example SQL, XQuery, XPath, REST, SOAP, and JMS. Technically, what this means
is that when an application prefers to use a SOAP/XML interface to access data while the data source
supports JMS, the data virtualization server must be able to translate SOAP/XML into JMS.
Lean Data Integration Integration of two data sources using ETL may require a lot of work. The integration
logic has to be designed and implemented, a database to store the result of the integration process has to
be setup, this database has to be tuned and optimized, it has to be managed during its entire operational
life, the integration process must be scheduled, it has to be checked, and so on.
Not Limited to Read-Only When a derived database, such as a data mart, is created to hold the result of
ETL jobs, the data in that database is read-only. Technically, its content can be changed, but it wouldnt
make sense, because the application would be updating derived data, not the source itself.
With data virtualization the source is accessed directly, so when data is changed using a data virtualization
server, its the source data thats changed. With data virtualization, new data can be inserted, and existing
data can be updated and deleted. Note that the source itself may not allow a data virtualization server to
change data.
The Logical Data Warehouse Users accessing data via a data virtualization server see one database
consisting of many tables. The fact that theyre accessing multiple data sources, is completely transparent.
They will see the database they query to be their data warehouse. But that data warehouse is not one
physical database anymore. It has become a logical concept and is therefore referred to as a logical data
warehouse.
even gather some quantitative information on the tables, such as the number of records or the size in
bytes. The result of an import is called a physical table; see Figure 3. A physical table can be seen as a
wrapper on the data source.
Figure 3 Physical tables are
used to wrap data sources.
<XML>
Physical tables
Data sources
Developing a physical table for a table in a SQL database involves a few simple clicks. When non-SQL
sources are accessed, it may be a bit more difficult. For example, if data is retrieved from an Excel
spreadsheet, it may be necessary to define the column names; when the source is a web service, a
mandatory parameter may have to be specified; and, when the source offers its data in an hierarchical
structure, for example in XML or JSON format, the logic to flatten the data is required. But usually, the
importing process is relatively straightforward.
The Logical Table Applications can use physical tables right after they have been defined. What they see is
of course the original data, the data as it is stored within the source system, including all the incorrect
data, misspelled data, and so on. In addition, the data has not been integrated with other sources yet. In
this case, the applications are responsible for integration; see Figure 3.
By defining logical tables (sometimes called virtual tables or logical data objects), transformation and
integration specifications can be defined; see Figure 4. The same transformation and integration logic that
normally ends up in ETL jobs, ends up in the definitions of logical tables. To applications, logical tables look
like ordinary tables; they have columns and records. The difference is that the contents is virtual. In that
respect a logical table is like a view in a SQL database. Its virtual content is derived when the physical
tables (the data sources) are accessed.
The definition of a logical table consists of a structure and a content. The content is defined using a SQL
query. Together, the structure and the query form the mapping of the logical table.
10
used to integrate,
transform, and cleanse
data.
Logical tables
<XML>
Physical tables
Data sources
The mapping defines how data from the physical tables have to be transformed and integrated.
Developers have the full power of SQL at their disposal to define the virtual content. Each operation that
can be specified in a SQL query can be used, including the following ones:
Filters can be specified to select a subset of all the rows from the source table.
Data from multiple physical tables can be joined together (integration)
Columns in physical tables can be removed (projection).
Values can be transformed by applying a long list of string manipulation functions.
Columns in physical tables can be concatenated.
Names of the columns in the source table and the name itself can be changed.
New virtual and derivable columns can be added.
Group-by operations can be specified to aggregate data.
Statistical functions can be applied.
Incorrect data values can be cleansed.
Rows can be sorted.
Rank numbers can be assigned to rows.
Nesting of Logical Tables Like views in a SQL database can be nested (or stacked), so can logical tables. In
other words, logical tables can be defined on top of others; see Figure 5. A logical table defined this way,
is sometimes referred to as a nested logical table. Logical tables can be nested indefinitely.
The biggest benefit of being able to nest virtual tables is that it allows
Nesting logical tables allows
for common specifications to be shared. For example, in Figure 5, two
for sharing of common
nested virtual tables, LT1 and LT2 are defined on a third, LT3. The
specifications.
advantage of this layered approach is that all the specifications inside
the mapping of LT3 are shared by the other two. If LT3s common specifications are changed, they
automatically apply to LT1 and LT2 as well. This can relate to cleansing, transformation, and integration
11
specifications. So, when an integration solution of two data sources has been defined, all other logical
tables and applications can reuse that.
LT1
LT2
Nested
logical tables
LT3
Logical tables
<XML>
Physical
tables
Publishing Logical Tables When logical tables have been defined, they need to be published. Publishing
means that the logical tables are made available for applications through one or more languages and APIs.
For example, one application wants to access a logical table using the language SQL and through the API
JDBC, whereas another prefers to access the same logical table as a web service using SOAP and HTTP.
Most data virtualization servers support a wide range of interfaces and languages. Here is a list of some of
the more popular ones:
SQL with ODBC
SQL with JDBC
SQL with ADO.NET
SOAP/XML via HTTP
ReST (Representational State Transfer) with JSON (JavaScript Object Notation)
ReST with XML
ReST with HTML
Note that when multiple technical interfaces are defined on one logical table, the mapping definitions are
reused. So, if a mapping is changed, all the applications, regardless of the technical interface they use, will
notice the difference.
Data Security Via Logical Tables Some source systems have their own data security system in place. They
support their own features to protect against incorrect or illegal use of the data. When a data
virtualization server accesses such sources, these security rules still apply, because the data virtualization
server is treated as a user of that data source.
But not all data sources have a data security layer in place. In that case, data security can be implemented
in the data virtualization server. For each table privileges can be granted in very much the same way as
access to tables in a SQL database is granted. Privileges such as select, insert, update or delete can be
granted. Note that data security can also be implemented when the data source supports its own data
12
security mechanism.
Some data virtualization servers allow access privileges to be granted on table level, on individual column
level, on record level, and even on individual value level. The last situation means that two users can have
access to one particular record, where one user sees all the values and the other user sees a value being
masked.
Query Optimization When accessing data sources, performance is crucial. Therefore, its important that
data virtualization servers know how to access the sources as efficiently as possible; they must support an
intelligent query optimizer.
One of the most important query optimization features is called pushdown. With pushdown, the data
virtualization server tries to push as much of the query processing to the data sources themselves. So,
when a query is received, it analyzes it, determines whether it can push the entire query to the data
source or whether parts can be pushed down. In case of the former, the result coming back from the
source needs no extra processing by the data virtualization server and can be pushed straight on to the
application. In case of the latter, the data virtualization server must do some extra processing before the
result received from the source can be returned to the data application.
Pushdown is required to minimize the amount of data transmitted back to the data virtualization server,
to let the database server do as little I/O as possible, and to do as little processing itself. All this is to
improve query performance.
13
Virtual Data Mart A data mart can be developed for many different reasons. One is to organize the tables
and columns in such a way that it becomes easy for reporting tools and users to understand and query the
data. So, the data marts are designed for a specific set of reports and users. In classic BI systems, data
marts are physical databases. The drawback of developing a physical data mart is, first, that the data mart
database has to be designed, developed, optimized, and managed, and second, that ETL processes have to
be designed, developed, optimized, managed, and scheduled to load the data mart.
With data virtualization, a data mart can be simulated using logical
Virtual data marts increase
tables. The difference is that the tables the users see are logical tables;
the agility of BI systems.
they are not physically stored. Their content is derived on-demand
when the logical tables are queried. Hence, the name virtual data mart. Users wont see the difference.
The big advantage of virtual data marts is agility. Virtual data marts can be developed and adapted more
quickly.
Extended Data Warehouse Not all data needed for analysis is available in the data warehouse. Nontraditional data sources, such as external data sources, call center log files, weblog files, voice transcripts
from customer calls, and personal spreadsheets, are often not included. This is unfortunate, because
including them can definitely enhance the analytical and reporting capabilities.
Enhancing a data warehouse with some of these data sources can be a
laborious and time-consuming effort. For some data sources, it can
take months before the data is incorporated in the chain of databases.
In the meantime, business users can in no way get an integrated view
of all that data, let alone they can invoke advanced forms of analysis.
With data virtualization servers, these sources together with the data warehouse will look like one
integrated database. This concept is called an extended data warehouse. In a way, it feels as if data
virtualization was designed for this purpose. In literature, this concept is comparable to the logical data
warehouse and the data delivery platform7.
Big Data Analytics More and more organizations have big data stored in Hadoop and NoSQL systems.
Unfortunately, most reporting and analytical tools arent able to access those database servers, because
most of them require a SQL or comparable interface. There are two ways to solve this problem. First,
relevant big data can be copied to a SQL database. However, in situations in which a Hadoop or NoSQL
solution is selected, the amount of data is probably massive. Copying all that data can be time-consuming,
and storing all that data twice can be costly.
The second solution is to use a data virtualization server on top of a Hadoop and NoSQL system, to wrap it
as a physical table, and to publish it with a SQL interface. The responsibility of the data virtualization
server is to translate the incoming SQL statements to the API or language of the big data system. Because
14
the interfaces of the NoSQL systems are proprietary, data virtualization servers must support dedicated
wrapper technology for each of them.
Operational Data Warehouse An operational data warehouse is normally described as a data warehouse that
not only holds historical data, but also operational data. It allows users to run reports on data that has
been entered a few seconds ago.
Implementing an operational data warehouse by copying new production data to the data warehouse fast
can be a technological challenge. By deploying a data virtualization server an operational data warehouse
can be simulated without the need to copy the data.
Data virtualization servers can be connected to all types of databases,
Data virtualization can offer
including production databases. So, if reports need access to
an operational data
operational data, logical tables can be defined that point to tables in a
warehouse without having to
production database. This allows an application to elegantly join
copy operational data.
operational data (in the production database) with historical data (in
the data warehouse). This makes it possible to develop an operational data warehouse without having to
actually build a data warehouse that stores operational data itself.
Normally, to minimize interference on production systems, ETL jobs are scheduled during so-called batch
windows. Many production systems operate 24x7, thus removing the batch window. The workload
generated by data virtualization servers is more in line with this 24x7 constraint, because the query
workload is spread out over the day. In addition, they support various features, such as caching and
pushdown optimization, to access data sources as efficiently as possible.
Offloading Cold Data Data stored in a data warehouse can be classified as cold, warm, or hot. Hot data is
used almost every day, and cold data occasionally. Keeping cold data in a data warehouse slows down the
majority of the queries and is expensive, because all the data is stored in an expensive data storage
system. If the data warehouse is implemented with a SQL database, it may be useful to store cold data
outside that database and in, for example, a Hadoop system. This solution saves storage costs, but more
importantly, it can speed up queries on the hot and warm data in the warehouse (less data), plus it can
handle larger data volumes.
But most importantly, a data virtualization server is very useful when data from the SQL-part of the data
warehouse must be pumped to the Hadoop system. It makes copying of the data straightforward, because
it comes down to a simple copy of the contents of one SQL table to another. And, with a data
virtualization server on the Hadoop big data files, reports can still access the cold data easily.
Cloud Transparency As indicated in Section 2, components of a data warehouse architecture are being
moved to the cloud. Such a migration can lead to changes in how data is accessed. A data virtualization
server can hide this migration. By always making sure that all the data is accessed via a data virtualization
server, the latter can hide where the data is physically stored. If a database is moved to the cloud, moved
back to on-premises, or migrated from one cloud to another, the data virtualization server can hide the
technical differences, thus making such a migration painless and transparent for the reports and users. In
this way, a data virtualization server implements cloud transparency.
15
Summary Data virtualization technology can be used for a long list of application areas. This section
describes a few, but more can be identified, such as sandboxing for data scientists, data services, and
prototyping of integration solutions.
In many BI systems
an organization use the same reporting tool. Usually a wide range of
specifications are not shared
tools is in use. Unfortunately, tools dont share integration,
across different BI tools.
transformation, or cleansing specifications; see Figure 6. A solution
developed for one tool cannot be re-used by another tool. (Note that in numerous BI systems, even if
users are using the same tool, specifications are not shared either.) This requires that the solution is
implemented in many different tools, leading to a replication of all integration, transformation, and
16
cleansing specifications.
BI tool 1
BI tool 2
repository
integration
solution
source 1
source 2
repository
integration
solution
source 3
BI tool 3
source 4
repository
integration
solution
source 5
share integration,
transformation, and
cleansing specifications;
they all have their own
central repository.
source 6
For example, a user can define the concept of a good customer based on the total number of orders he
has placed, the average value of orders, the number of products returned, the age of the orders, and so
on. He can enter filters and formulas to distinguish the good ones from the bad ones. If another user
needs a similar concept but uses another tool, he has to define it himself using his own tool. The
consequence is that the wheel is reinvented over and over again.
It must be clear that no sharing of specifications lowers the agility level of a BI system, reduces the
productivity of BI developers, and lowers the correctness and consistency of report results.
17
developers, and improves the correctness and consistency of report results. In addition, if the integration
of some data sources is complex, it can be implemented by IT specialists (using logical tables) for all BI
users. So, no reinvention of the wheel, but sharing and reusing specifications.
BI tool 1
BI tool 2
BI tool 3
repository
data virtualization
with shared specifications
source 1
source 2
source 3
source 4
source 5
source 6
With respect to self-service BI tools, if users prefer to define integration specifications themselves, they
can still do that by accessing the low-level logical tables. Instead of defining concepts with their own tool,
such as good customer, they can create re-usable specifications themselves in the data virtualization
server. They will experience the same level of easy-to-use and flexibility they are used to with their selfservice tools. Changing specifications in the data virtualization server is as easy as changing comparable
specifications in a self-service BI tool.
Collaborative and Rapid Development Each data virtualization product offers on-demand data viewing. When
a logical table is created, users can study its (virtual) contents right away. JDV also supports on-demand
5
Teiid is a type of lizard. In addition, the name contains the acronym EII, which stands for Enterprise Information
Integration. The term EII can be seen as a forerunner for data virtualization.
Copyright 2014 R20/Consultancy, all rights reserved.
18
data visualization through dashboarding. With this feature, the virtual contents of logical tables can be
visualized as bar charts, pie charts, and so on, right after the table has been created; see Figure 8.
Figure 8 A sceenshot of JBoss Data Virtualization showing data visualization through dashboards.
This on-demand data visualization feature allows for collaborative development. Analysts and business
users can sit together and work on the definitions of logical tables together. The analysts will work on the
definitions and the models (which may be too technical for some users) and the users will see an intuitive
visualization of the datatheir data. Because of this collaborative development, less costly development
time will be lost on incorrect implementations. It leads to rapid development.
Design Environment The logical tables can be defined using a graphical and easy-to-use design
environment. Figure 9 contains a screenshot showing the five logical tables and their relationships. The
familiar JBoss Developer Studio is an add-on and can be used as design and development environment.
Lineage and Impact Analysis JDV stores all the definitions of concepts, such as data sources, logical tables,
and physical tables, in one central repository. This makes it easy for JDV to show all the dependencies
between these objects. This feature helps, for example, to determine what the effect will be if the
structure of a source table or logical table changes; which other logical tables must be changed as well? In
other words, JDV offers lineage and impact analysis.
Accessing Data Sources JDV can access a long list of source systems, including most well-known SQL
databases servers (Oracle, DB2, SQL Server, MySQL, and PostgreSQL), enterprise data warehouse
platforms (Teradata, Netezza, and EMC/Greenplum), office tools (Excel, Access, and Google
Spreadsheets), applications (SAP and Salesforces.com), flat files, XML files, SOAP and REST web services,
and OData services.
19
Figure 9 A sceenshot of JBoss Data Virtualization showing the relationships between tables.
Accessing Big Data Hadoop and NoSQL For accessing big data stored in Hadoop or NoSQL systems, JDV
comes with interfaces for HDFS and MongoDB. With respect to Hadoop, JDVs implementation works via
Hive. So, JDV sends the SQL query coming from the application to Hive, and Hive translates that query into
MapReduce code, which is then executed in parallel on HDFS files. In the case of MongoDB, hierarchical
data is flattened to a relational table.
User-Defined Functions For logic too complex for SQL, developers can build their own functions. Examples
may be complex statistical functions or functions to turn a complex value into a set of simple values.
These user-defined functions can be developed in Java and can be invoked from SQL statements. Invoking
the UDFs is comparable to invoking built-in SQL functions.
Two Languages For Developing Logical Tables Logical tables can be defined using SQL queries or stored
procedures. The SQL dialect supported is rich enough to specify most of the required transformations,
aggregations, and integrations. Stored procedures may be needed, for example, when non-SQL source
systems are accessed that require complex transformations to turn non-relational data into more
relational structures.
Query Optimizer To improve query performance, JDVs query optimizer supports various techniques to
push down most or all of the query processing to the data sources. It supports several join processing
strategies, such as merge joins and distributed joins. The processing strategy (or query plan) selected by
the optimizer can be studied by the developers. The optimizer is not a black box. This openness of the
optimizer is very useful for tuning and optimizing queries.
20
Caching JDV supports two forms of caches: internal materialized caches and external materialized caches.
With internal materialized caches, the cache is kept in memory. The advantage of using internal
materialized caches is fast access to the data, but the disadvantage is that memory is limitednot all data
can be cached. With external materialized caches, data is stored in a SQL database. In this case, there is no
restriction on the size of the cached data. But it will be somewhat slower than the memory alternative.
Publishing Logical Tables JDV supports a long list of APIs through which logical tables can be accessed
which includes JDBC, ODBC, SOAP, RESR, and OData.
Data Security JDV supports the four different forms of data access security specified in Section 5.
Privileges such as select, insert, update, and delete can be granted on the table level, individual column
level, record level, and even on individual value level. This makes it possible to let JDV operate as a data
security firewall. In addition, when logical tables are published using particular APIs, security aspects can
be defined as well. For example, developers can publish a logical table using SOAP with WS-Security
extended.
Embeddable All the functionality of JDV can be invoked through an open API. This means that applications
can invoke the functionality of JDV. This makes JDV an embeddable data virtualization server. Vendors and
organizations can use this API to develop embeddable solutions.
Summary JBoss Data Virtualization is a mature data virtualization server that allows organizations to
develop BI systems with more agile architectures. Its on-demand data integration capabilities makes it
ready for many application areas:
Virtual data marts
Extended data warehouse
Big data analytics
Operational data warehouse
Offloading cold data
Cloud transparency
JDV allows the development of agile BI systems that are ready for the new challenges BI systems are faced
with.
21
See http://www.b-eye-network.com/channels/5087/articles/
See http://www.b-eye-network.com/channels/5087/view/12495
8
R.F. van der Lans, Data Virtualization for Business Intelligence Systems, Morgan Kaufmann Publishers, 2012.
9
R.F. van der Lans, Introduction to SQL; Mastering the Relational Database Language, fourth edition, Addison-Wesley, 2007.
7