RHJB Rethink Data Integration Delivering Agile Bi Systems Data Virt

Re-think Data Integration:
Delivering Agile BI Systems

With Data Virtualization
A Technical Whitepaper
Rick F. van der Lans

Independent Business Intelligence Analyst
R20/Consultancy
March 2014
Sponsored by
Copyright 2014 R20/Consultancy. All rights reserved. Red Hat, Inc., Red Hat, Red Hat Enterprise Linux,
the Shadowman logo, and JBoss are trademarks of Red Hat, Inc., registered in the U.S. and other
countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries.
Trademarks of companies referenced in this document are the sole property of their respective owners.
Table of Contents
1
Management Summary
The New Challenges for BI Systems
Current BI Systems and the New Challenges
On-Demand Integration with Data Virtualization
Under The Hood of Data Virtualization Servers
BI Application Areas for Data Virtualization
13
Data Virtualization and the New BI Challenges
15
Data Virtualization Simplifies Sharing of Integration Specifications
15
Overview of Red Hats JBoss Data Virtualization Server
17
About the Author Rick F. van der Lans
21
About Red Hat, Inc.
21
Copyright 2014 R20/Consultancy, all rights reserved.
Re-Think Data Integration: Delivering Agile BI Systems with Data Virtualization
1 Management Summary
It didnt happen overnight, but a new era for business intelligence (BI) has arrived. Gone are the days
when BI systems presented yesterdays data, when internal data was sufficient to cover all the
organizations BI needs, and when development of new reports took a few weeks or even months.
Today, organizations rely much more on BI systems than ever before.
Having access to the right data and at the right time in the right form is
crucial for decision making processes in organizations in todays fast
moving world of business. BI has become a key instrument for
organizations to distinguish themselves and to stay competitive.
BI has become a key

instrument for organizations
to distinguish themselves and
to stay competitive.
In this new era, BI systems have to change, because theyre confronted with new technological
developments and new business requirements. These are some of the key challenges that BI systems are
facing:
Productivity improvement: Because the speed of business continues to increase, the productivity
of BI developers has to improve as well. BI development must follow this speed of business.
Taking a few weeks to develop a report is not acceptable anymore.
Self-service BI: BI systems have to support self-service BI tools that allow users to develop and
maintain their own reports.
Operational intelligence: Users want to analyze operational data, not yesterdays data. This form
of analytics is called operational intelligence or real-time analytics.
Big data, Hadoop, and NoSQL: Undoubtedly, one of the
Allow users to exploit big data
biggest changes in the BI industry is initiated by big data. BI
for reporting and analytics as
systems must embrace big data together with the
easily as small data.
accompanying Hadoop and NoSQL data storage technologies.
The challenge is to allow users to use big data for reporting and analytics as easily as the data
stored in classic systems.
Systems in the cloud: Organizations are migrating BI system components to the much-discussed
cloud. BI systems must be able to embrace cloud technology and cloud solutions in a transparent
way.
Data in the cloud: Analytical capabilities can be extended by enriching internal data with external
data. On the internet, countless sources containing valuable external data are available, such as
social media data and the numerous open data sources. The challenge for BI systems is to
integrate all this valuable data in the cloud with internal enterprise data.
For many current BI systems it will be difficult to embrace all these challenges. The main reason is that
their internal architectures are made up of a chain of databases, consisting of staging areas, data
warehouses, and data marts. Data is made available to users by copying it from one database to another
and with each copy-process the shape and form of the data gets closer to what the users require. This
data supply chain has served many organizations well for many years, but is now becoming an obstacle. It
was designed with a built to last mentality, however, organizations are asking for solutions that are built
with a designed for change approach.
This whitepaper describes a lean form of on-demand data integration
Deploying data virtualization
technology called data virtualization. Deploying data virtualization
results in BI systems with
results in BI systems with simpler and more agile architectures that can
simpler and more agile
confront the new challenges much easier. All the key concepts of data
architectures.
virtualization are described, including logical tables, importing data
sources, data security, caching, and query optimization. Examples are
given of application areas of data virtualization for BI, such as virtual data marts, big data analytics,
extended data warehouse, and offloading cold data.
The whitepaper ends with an overview of the first open source data virtualization server: Red Hats JBoss
Data Virtualization.
2 The New Challenges for BI Systems

BI systems are faced with new technological developments and new business requirements. The
consequence is that BI systems have to change. This section lists some of the key challenges that BI
systems face today.
Productivity Improvement If IT wants to assist organizations to stay competitive and cost-effective,

development of BI systems must follow the ever increasing speed of business. This was clearly shown in a
study by the Aberdeen Group1: 43% of enterprises find that making timely decisions is becoming more
difficult. Managers increasingly find they have less time to make decisions after certain business events
occur. The consequence is that it must be possible to modify existing reports faster and to develop new
ones more quickly.
Unfortunately, whether its due to the quality of the tools, the developers themselves, the continuously
changing needs of users, or the inflexible architecture of most BI systems, many IT departments struggle
with their BI productivity. BI backlogs are increasing.
Self Service BI The approach used by many IT departments to develop BI reports is an iterative one. It
usually starts with a user requesting a report. Next, a representative from the IT department begins with
analyzing the users needs. This involves interviews and the study of data structures, various reports, and
documents. In most cases, it also involves a detailed analysis process by the IT specialist primarily to
understand what the user is requesting and what all the terms mean. This process of understanding can
be very time-consuming. Eventually, the IT specialist comes up with an initial version of the report, which
is shown to the user for review. If its not what the user wants, the specialist starts to work on a second
version and presents that to the user. This process may involve a number of iterations, depending on how
good the user is in specifying his needs and how good the analyst is in understanding the user and his
1
Aberdeen Group, Agile BI: Three Steps to Analytic Heaven, April 2011, see
https://www.tableausoftware.com/sites/default/files/whitepapers/agile_bi.pdf
needs. Finally, all this work leads to an implementation.

Its obvious that this iterative process can be time-consuming. Understandably, many users have started
to look for an alternative solution and they found self-service BI tools. These tools with their intuitive
interfaces have been designed for users to develop their own reports. Users already understand their own
needs and they know what they want. This means that many of the steps described above can be skipped
and that improves productivity dramatically.
But self-development can lead to chaos. Users are not professional developers, they have not been
trained to develop re-usable solutions or formal testing techniques, and they dont aim for developing
shared metadata and integration specifications. Their only goal is to develop their report as quickly as
possible. The challenge for BI systems is how to manage this self-service development in such a way that
the reports return correct and consistent results and that the wheel is not reinvented over and over again.
Operational Intelligence There was a time when users of data

Yesterdays data is worthless
warehouses were satisfied with reports containing one week old data.
to more and more users, they
Today, users dont accept such a data latency anymore, they want a
need operational intelligence.
data latency of one minute, or maybe even a few seconds or less.
Especially user groups such as operational management and external parties want to have insight in the
most current situationyesterdays data is worthless to them. This form of BI, in which users need a very
low data latency, is called operational intelligence (sometimes called real-time analytics).
If new data is entered in production systems, and the reporting is done on data marts, the key technical
challenge is to copy data from the production systems very rapidly, via the staging area and data
warehouse to the data marts. It must be clear to everyone that the longer the chain, the higher the data
latency. And in many BI systems the chain is long.
Big Data, Hadoop, and NoSQL Undoubtedly, one of the biggest trends in the BI industry is big data. Gartner2
predicts that big data will drive $232 billion in spending through 2016, and Wikibon3 claims that in 2017
big data revenue will have grown to $50.1 billion. Many organizations have adopted big data. There are
those who are studying what big data could mean for them, many are in the process of developing big
data systems, and some are already relying on these systems. And they all do it to enrich their analytical
capabilities.
Whether big data is structured, unstructured, multi-structured, or
BI systems should allow users
semi-structured, its always a massive amount of data. To handle such
to do reporting and analytics
large databases, many organizations have decided not to deploy
on big data as easily as on
familiar SQL systems but Hadoop systems or NoSQL systems, such as
small data.
MongoDB and Cassandra. Hadoop and NoSQL systems are designed
for big data workloads, are powerful and scalable, but are different from the SQL systems. First of all, they
dont always support the popular SQL database language nor the familiar relational concepts, such as
tables, columns, and records. Second, many of them support their own API, database language, and set of
database concepts. So, expertise in one product cant always easily be reused in another.
Gartner, October 2012; see http://techcrunch.com/2012/10/17/big-data-to-drive-232-billion-in-it-spending-through-2016/

Wikibon, Big Data Vendor Revenue and Market Forecast 2013-21017, February 12, 2014; see
http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2013-2017
3
The challenge for BI will be to integrate all the big data stored in the Hadoop and NoSQL systems with the
data warehouse environment, so that users can use big data for reporting and analytics as easily as they
can small data.
Systems in the Cloud All the software components of a BI system, including the production systems, the
staging area, the data warehouse, and the data marts, used to run on-premises. Nowadays, components
may reside in the cloud.
Moving a component to the cloud can have impact on its performance, the technical interface, security
aspects, and so on. Such changes can require redevelopment. The challenge is to adopt cloud solutions in
a transparent way. For example, when a data mart is moved to the cloud, or when a cloud-based ERP
system is introduced, all this should be as transparent as possible.
Data in the Cloud The data sources of a data warehouses used to be

Integrating internal data with
limited to internal production systems. Reporting and analytics on
external data enriches
internal, enterprise data can lead to useful business insights, but the
analytical and reporting
cloud contains massive amounts of valuable external data that
capabilities.
enriches analytical capabilities and leads to more unexpected insights.
For example, by integrating internal customer data with social media data, a more detailed and complete
picture can be developed of what a customer thinks about the products and the company.
Nowadays, loads of external data are available in the cloud of which social media data is the most wellknown. But its not only social media data. Thousands and thousands of open data sources have become
available for the public. Examples of open data sources are weather data, demographic data, energy
consumption data, hospital performance data, public transport data, and the list goes on. Almost all these
open data sources are available in the cloud through some API.
The challenge for BI systems is to integrate all this valuable cloud-based data with internal enterprise
data. Copying all this data may be too expensive, so smarter solutions must be developed.
3 Current BI Systems and the New Challenges

The challenges described in the previous section may be hard to implement in existing BI systems, due to
their architecture. This section describes the classic BI architecture and summarizes why the challenges
described in the previous section cannot be easily met.
Classic BI Systems The architectures of most BI systems resemble a
The architectures of most BI

long chain of databases; see Figure 1. In such systems, data is entered
systems resemble a long
using production applications and stored in production databases.
chain of databases.
Data is then copied via a staging area and an operational data store to
a central data warehouse. Next, its copied to data marts in which the majority of all the reporting and
analytics takes place.
production
applications
analytics
& reporting
ETL
production
databases
ETL
staging
area
ETL
operational
data store
Figure 1 Most BI systems
consist of a chain of databases in

which new data is entered in
production systems and from
there copied from one database
to another.
ETL
data
warehouse
data marts
ETL jobs are commonly used to copy data from one database to another. They are responsible for
transforming, integrating, and cleansing the data. ETL jobs are scheduled to run periodically. In other
words, data integration and transformation is executed as batch process.
The chain of databases and the ETL jobs that link them together form
a factory that transforms raw data (in the production databases) to
data for reporting purposes. This is very much like a real assembly line
in which raw products are processed, step by step, into end products.
This chain is a data supply chain.
The chain of databases and

ETL jobs are very much like an
assembly line.
Classic BI Systems and the New Challenges This data supply chain, with
Data supply chains should be

its batch-oriented style of copying, has served numerous organizations
developed with a designed
well for many years, but is now becoming an obstacle. It was designed
for change approach.
with a built to last mentality. The consequence is that apparently
simple report changes can lead to an eruption of changes to databases
and ETL jobs and thus consuming a lot of development time. There are many steps where things can go
wrong. Due to its ever growing nature, the chain has been stretched to the limit and has become brittle.
Today, organizations demand solutions that are built with a designed for change approach.
But most worrisome is that its difficult for these systems to meet the new challenges:
Productivity Improvement: A key characteristic of ETL is that integration results can only be used
when they have been stored in a database. Such a database has to be installed, designed,
managed, kept up to date, and so on. All this costs manpower.
Self-Service BI: Self-service BI or not, reports should return
Specifications entered by
correct and consistent results and productivity should be high,
users of self-service BI users
else, nothing is gained. This requires that specifications
must be shareable.
entered by self-service BI users must be shareable. The need
to reinvent the wheel should be minimal. Unfortunately, current BI systems dont have a
module that makes it easy for users to share specifications.
Operational Intelligence: As indicated in the previous section, new data entered in production
systems is copied several times before it becomes available for reporting. This is far from ideal for
users interested in analyzing zero-latency data. Somehow, the chain must be shortened; there
should be fewer databases and fewer ETL jobs.

Big Data, Hadoop, and NoSQL: Big data is sometimes too big to copy. A copying process may take
too long or the storage of duplicated big data can be too expensive. With respect to data
transformation, data integration, and data cleansing, big data must be processed differently. It
should not be pushed through the chain.
Systems in the Cloud: When systems are moved to the cloud, it may be necessary to change the
way in which data is extracted. For example, copying data can take longer when running in the
cloud. It may be required to encrypt the data when its transmitted over the Internet, which may
not have been relevant before. The architecture of BI systems should be flexible enough so that
moving components into, out of, or within the cloud is transparent.
Data in the Cloud: In principle, external data in the cloud can be processed in the same way as
internal data: it can be extracted, integrated, transformed, and then copied through the chain of
databases. However, it may be more convenient to run reports directly on these external data
sources. Such a solution would be hard to fit in the existing architecture. For example, when a
report needs data from a data mart and an external data source, how and where is that data
integrated, transformed, and cleansed?
Summary The classic BI architecture consisting of a chain of databases and ETL jobs has served us well for
many years, but its not evident how BI systems, by sticking to this architecture, can face the new
challenges.
4 On-Demand Integration with Data Virtualization

This section describes a newer technology for data integration called data virtualization and how it offers
a lean form of on-demand data integration. The following sections describe respectively how these
products work, their application areas, and how they meet the requirements listed in Section 2.
Data Virtualization in a Nutshell Data virtualization is a technology for integrating, transforming, and
manipulating data coming from all kinds of data sources and presenting all that data as one unified view
to all kinds of applications. Data virtualization provides an abstraction and encapsulation layer that, for
applications, hides most of the technical aspects of how and where data is stored; see Figure 2. Because of
that layer, applications dont need to know where all the data is physically stored, how the data should be
integrated, where the database servers run, how to insert and update the data, what the required APIs
are, which database language to use, and so on. When data virtualization is deployed, to every application
it feels as if one large database is accessed.
For completeness sake, here is the definition of data virtualization4:
Data virtualization is a technology that offers data consumers a unified, abstracted, and
encapsulated view for querying and manipulating data stored in a heterogeneous set of data
stores.
4
Rick F. van der Lans, Data Virtualization for Business Intelligence Systems, Morgan Kaufmann, 2012.
production
application
analytics
& reporting
internal
portal
mobile
App
website
dashboard
Data Virtualization Server
unstructured
data
data
warehouse
& data marts
production
databases
applications
streaming
databases
big data
stores
ESB
private
data
social
media data
external
data
Figure 2 Data virtualization servers make a heterogeneous set of data sources look like one logical database to the
applications.
Data Virtualization Offers On-Demand Integration When an application requests a data virtualization server to
integrate data from multiple sources, the integration is executed on-demand. This is very different from
ETL-based integration, where integration takes place before the application asks for it. In a typical ETL
environment the retrieved data may have been integrated a week before. Not so, with data virtualization
where integration is done live. When an applications asks for data, only then will the data virtualization
server retrieve the required data from the source databases, and integrate, transform, and cleanse it.
Compare this to buying sandwiches. When a customer orders a
sandwich in a restaurant, all the ingredients, such as the sandwich, the
ham, the cheese, and the lettuce, are all integrated in the kitchen
right there and then. Thats data virtualization! ETL compares with
buying a pre-packaged sandwich at a shop where the integration of all
the ingredients was done early in the morning or maybe even the
evening before. Data virtualization is really on-demand data integration.
Integration with data

virtualization is like ordering
a sandwich in a restaurant
where all the ingredients are
integrated in the kitchen
right there and then.
Transformations and Translations Because data virtualization technology offers access to many different
data source technologies, many different APIs and languages are supported. It must be possible to handle
requests specified in, for example SQL, XQuery, XPath, REST, SOAP, and JMS. Technically, what this means
is that when an application prefers to use a SOAP/XML interface to access data while the data source
supports JMS, the data virtualization server must be able to translate SOAP/XML into JMS.
Lean Data Integration Integration of two data sources using ETL may require a lot of work. The integration
logic has to be designed and implemented, a database to store the result of the integration process has to
be setup, this database has to be tuned and optimized, it has to be managed during its entire operational
life, the integration process must be scheduled, it has to be checked, and so on.
The key advantage of on-demand integration via data virtualization is

The benefits of lean
lean data integration. With data virtualization only the integration
integration
are speed of
logic has to be designed and implemented. When this is done,
delivery and ease of change.
applications can access the integrated data result. There is no need to
develop and manage extra databases to hold integration results. The
benefits of this lean form of integration are speed of delivery and the ease with which an existing
integration solution can be changed.
Not Limited to Read-Only When a derived database, such as a data mart, is created to hold the result of
ETL jobs, the data in that database is read-only. Technically, its content can be changed, but it wouldnt
make sense, because the application would be updating derived data, not the source itself.
With data virtualization the source is accessed directly, so when data is changed using a data virtualization
server, its the source data thats changed. With data virtualization, new data can be inserted, and existing
data can be updated and deleted. Note that the source itself may not allow a data virtualization server to
change data.
The Logical Data Warehouse Users accessing data via a data virtualization server see one database
consisting of many tables. The fact that theyre accessing multiple data sources, is completely transparent.
They will see the database they query to be their data warehouse. But that data warehouse is not one
physical database anymore. It has become a logical concept and is therefore referred to as a logical data
warehouse.
Data Virtualization Does Not Replace ETL Sometimes data virtualization
Data virtualization and ETL

is considered unjustly to be a threat to ETL. Data virtualization does
are complementary
not replace ETL. Admitted, some ETL work will be replaced by onintegration solutions.
demand integration, but not in all places. For example, in many (and
probably most) organizations a physical data warehouse will still be
necessary to, for example, keep track of historical data, or because the production systems cant be
accessed because of potential performance or stability problems. If that physical data warehouse is still
needed, ETL can be the right integration technology to load data periodically.
Data virtualization and ETL are complementary integration solutions, each with its own strengths and
weaknesses. Its up to the architect to determine what the best solution is for a specific integration
problem.
5 Under the Hood of Data Virtualization Servers

Most data virtualization servers support comparable concepts. This section describes these key concepts.
Importing Data Sources Before data in source systems can be accessed
A physical table is a wrapper

via a data virtualization server, their specifications must be imported.
on a data source.
This doesnt mean that the data is loaded in the data virtualization
server, but that a full description of, for example, a physical SQL table is stored by the data virtualization
server in its own repository. This description includes the structure (columns) of the physical table, the
data types and lengths of each column, the physical location of the table, security details, and so on. Some
even gather some quantitative information on the tables, such as the number of records or the size in
bytes. The result of an import is called a physical table; see Figure 3. A physical table can be seen as a
wrapper on the data source.
Figure 3 Physical tables are
used to wrap data sources.
<XML>
Physical tables
Data sources
Developing a physical table for a table in a SQL database involves a few simple clicks. When non-SQL
sources are accessed, it may be a bit more difficult. For example, if data is retrieved from an Excel
spreadsheet, it may be necessary to define the column names; when the source is a web service, a
mandatory parameter may have to be specified; and, when the source offers its data in an hierarchical
structure, for example in XML or JSON format, the logic to flatten the data is required. But usually, the
importing process is relatively straightforward.
The Logical Table Applications can use physical tables right after they have been defined. What they see is
of course the original data, the data as it is stored within the source system, including all the incorrect
data, misspelled data, and so on. In addition, the data has not been integrated with other sources yet. In
this case, the applications are responsible for integration; see Figure 3.
By defining logical tables (sometimes called virtual tables or logical data objects), transformation and
integration specifications can be defined; see Figure 4. The same transformation and integration logic that
normally ends up in ETL jobs, ends up in the definitions of logical tables. To applications, logical tables look
like ordinary tables; they have columns and records. The difference is that the contents is virtual. In that
respect a logical table is like a view in a SQL database. Its virtual content is derived when the physical
tables (the data sources) are accessed.
The definition of a logical table consists of a structure and a content. The content is defined using a SQL
query. Together, the structure and the query form the mapping of the logical table.
10
Figure 4 Logical tables are
used to integrate,
transform, and cleanse
data.
Logical tables
<XML>
Physical tables
Data sources
The mapping defines how data from the physical tables have to be transformed and integrated.
Developers have the full power of SQL at their disposal to define the virtual content. Each operation that
can be specified in a SQL query can be used, including the following ones:
Filters can be specified to select a subset of all the rows from the source table.
Data from multiple physical tables can be joined together (integration)
Columns in physical tables can be removed (projection).
Values can be transformed by applying a long list of string manipulation functions.
Columns in physical tables can be concatenated.
Names of the columns in the source table and the name itself can be changed.
New virtual and derivable columns can be added.
Group-by operations can be specified to aggregate data.
Statistical functions can be applied.
Incorrect data values can be cleansed.
Rows can be sorted.
Rank numbers can be assigned to rows.
Nesting of Logical Tables Like views in a SQL database can be nested (or stacked), so can logical tables. In
other words, logical tables can be defined on top of others; see Figure 5. A logical table defined this way,
is sometimes referred to as a nested logical table. Logical tables can be nested indefinitely.
The biggest benefit of being able to nest virtual tables is that it allows
Nesting logical tables allows
for common specifications to be shared. For example, in Figure 5, two
for sharing of common
nested virtual tables, LT1 and LT2 are defined on a third, LT3. The
specifications.
advantage of this layered approach is that all the specifications inside
the mapping of LT3 are shared by the other two. If LT3s common specifications are changed, they
automatically apply to LT1 and LT2 as well. This can relate to cleansing, transformation, and integration
11
specifications. So, when an integration solution of two data sources has been defined, all other logical
tables and applications can reuse that.
LT1
Figure 5 Logical tables can be
LT2
Nested
logical tables
LT3
nested so that common

specifications can be shared.
Logical tables
<XML>
Physical
tables
Publishing Logical Tables When logical tables have been defined, they need to be published. Publishing
means that the logical tables are made available for applications through one or more languages and APIs.
For example, one application wants to access a logical table using the language SQL and through the API
JDBC, whereas another prefers to access the same logical table as a web service using SOAP and HTTP.
Most data virtualization servers support a wide range of interfaces and languages. Here is a list of some of
the more popular ones:
SQL with ODBC
SQL with JDBC
SQL with ADO.NET
SOAP/XML via HTTP
ReST (Representational State Transfer) with JSON (JavaScript Object Notation)
ReST with XML
ReST with HTML
Note that when multiple technical interfaces are defined on one logical table, the mapping definitions are
reused. So, if a mapping is changed, all the applications, regardless of the technical interface they use, will
notice the difference.
Data Security Via Logical Tables Some source systems have their own data security system in place. They
support their own features to protect against incorrect or illegal use of the data. When a data
virtualization server accesses such sources, these security rules still apply, because the data virtualization
server is treated as a user of that data source.
But not all data sources have a data security layer in place. In that case, data security can be implemented
in the data virtualization server. For each table privileges can be granted in very much the same way as
access to tables in a SQL database is granted. Privileges such as select, insert, update or delete can be
granted. Note that data security can also be implemented when the data source supports its own data
12
security mechanism.
Some data virtualization servers allow access privileges to be granted on table level, on individual column
level, on record level, and even on individual value level. The last situation means that two users can have
access to one particular record, where one user sees all the values and the other user sees a value being
masked.
Caching of Logical Tables As indicated, data virtualization servers
Caching of logical tables is

support on-demand data integration. Because doing integration live is
used to improve query
not always preferred, they support caching. For each logical table a
performance.
cache can be defined. The effect is that the virtual content is
materialized: the content is determined by running the query and the result is stored in a cache. From
then on, when an application accesses a cached logical table, the data source is not accessed, but the
answer is determined by retrieving data from the cache.
The reasons why caching is used, are diverse:
Query performance
Load optimization
Consistent reporting
Source availability
Complex transformations
Regardless of whether caches are kept in memory, in a file, or in database, they are managed by the data
virtualization server itself. For each cached logical table a refresh schedule must be defined.
Query Optimization When accessing data sources, performance is crucial. Therefore, its important that
data virtualization servers know how to access the sources as efficiently as possible; they must support an
intelligent query optimizer.
One of the most important query optimization features is called pushdown. With pushdown, the data
virtualization server tries to push as much of the query processing to the data sources themselves. So,
when a query is received, it analyzes it, determines whether it can push the entire query to the data
source or whether parts can be pushed down. In case of the former, the result coming back from the
source needs no extra processing by the data virtualization server and can be pushed straight on to the
application. In case of the latter, the data virtualization server must do some extra processing before the
result received from the source can be returned to the data application.
Pushdown is required to minimize the amount of data transmitted back to the data virtualization server,
to let the database server do as little I/O as possible, and to do as little processing itself. All this is to
improve query performance.
13
6 BI Application Areas for Data Virtualization

Currently, data virtualization is used in many different ways in BI systems. This section describes some of
the more popular application areas.
Virtual Data Mart A data mart can be developed for many different reasons. One is to organize the tables
and columns in such a way that it becomes easy for reporting tools and users to understand and query the
data. So, the data marts are designed for a specific set of reports and users. In classic BI systems, data
marts are physical databases. The drawback of developing a physical data mart is, first, that the data mart
database has to be designed, developed, optimized, and managed, and second, that ETL processes have to
be designed, developed, optimized, managed, and scheduled to load the data mart.
With data virtualization, a data mart can be simulated using logical
Virtual data marts increase
tables. The difference is that the tables the users see are logical tables;
the agility of BI systems.
they are not physically stored. Their content is derived on-demand
when the logical tables are queried. Hence, the name virtual data mart. Users wont see the difference.
The big advantage of virtual data marts is agility. Virtual data marts can be developed and adapted more
quickly.
Extended Data Warehouse Not all data needed for analysis is available in the data warehouse. Nontraditional data sources, such as external data sources, call center log files, weblog files, voice transcripts
from customer calls, and personal spreadsheets, are often not included. This is unfortunate, because
including them can definitely enhance the analytical and reporting capabilities.
Enhancing a data warehouse with some of these data sources can be a
laborious and time-consuming effort. For some data sources, it can
take months before the data is incorporated in the chain of databases.
In the meantime, business users can in no way get an integrated view
of all that data, let alone they can invoke advanced forms of analysis.
Data virtualization allows for

easy development of an
extended data warehouse.
With data virtualization servers, these sources together with the data warehouse will look like one
integrated database. This concept is called an extended data warehouse. In a way, it feels as if data
virtualization was designed for this purpose. In literature, this concept is comparable to the logical data
warehouse and the data delivery platform7.
Big Data Analytics More and more organizations have big data stored in Hadoop and NoSQL systems.
Unfortunately, most reporting and analytical tools arent able to access those database servers, because
most of them require a SQL or comparable interface. There are two ways to solve this problem. First,
relevant big data can be copied to a SQL database. However, in situations in which a Hadoop or NoSQL
solution is selected, the amount of data is probably massive. Copying all that data can be time-consuming,
and storing all that data twice can be costly.
The second solution is to use a data virtualization server on top of a Hadoop and NoSQL system, to wrap it
as a physical table, and to publish it with a SQL interface. The responsibility of the data virtualization
server is to translate the incoming SQL statements to the API or language of the big data system. Because
14
the interfaces of the NoSQL systems are proprietary, data virtualization servers must support dedicated
wrapper technology for each of them.
Operational Data Warehouse An operational data warehouse is normally described as a data warehouse that
not only holds historical data, but also operational data. It allows users to run reports on data that has
been entered a few seconds ago.
Implementing an operational data warehouse by copying new production data to the data warehouse fast
can be a technological challenge. By deploying a data virtualization server an operational data warehouse
can be simulated without the need to copy the data.
Data virtualization servers can be connected to all types of databases,
Data virtualization can offer
including production databases. So, if reports need access to
an operational data
operational data, logical tables can be defined that point to tables in a
warehouse without having to
production database. This allows an application to elegantly join
copy operational data.
operational data (in the production database) with historical data (in
the data warehouse). This makes it possible to develop an operational data warehouse without having to
actually build a data warehouse that stores operational data itself.
Normally, to minimize interference on production systems, ETL jobs are scheduled during so-called batch
windows. Many production systems operate 24x7, thus removing the batch window. The workload
generated by data virtualization servers is more in line with this 24x7 constraint, because the query
workload is spread out over the day. In addition, they support various features, such as caching and
pushdown optimization, to access data sources as efficiently as possible.
Offloading Cold Data Data stored in a data warehouse can be classified as cold, warm, or hot. Hot data is
used almost every day, and cold data occasionally. Keeping cold data in a data warehouse slows down the
majority of the queries and is expensive, because all the data is stored in an expensive data storage
system. If the data warehouse is implemented with a SQL database, it may be useful to store cold data
outside that database and in, for example, a Hadoop system. This solution saves storage costs, but more
importantly, it can speed up queries on the hot and warm data in the warehouse (less data), plus it can
handle larger data volumes.
But most importantly, a data virtualization server is very useful when data from the SQL-part of the data
warehouse must be pumped to the Hadoop system. It makes copying of the data straightforward, because
it comes down to a simple copy of the contents of one SQL table to another. And, with a data
virtualization server on the Hadoop big data files, reports can still access the cold data easily.
Cloud Transparency As indicated in Section 2, components of a data warehouse architecture are being
moved to the cloud. Such a migration can lead to changes in how data is accessed. A data virtualization
server can hide this migration. By always making sure that all the data is accessed via a data virtualization
server, the latter can hide where the data is physically stored. If a database is moved to the cloud, moved
back to on-premises, or migrated from one cloud to another, the data virtualization server can hide the
technical differences, thus making such a migration painless and transparent for the reports and users. In
this way, a data virtualization server implements cloud transparency.
15
Summary Data virtualization technology can be used for a long list of application areas. This section
describes a few, but more can be identified, such as sandboxing for data scientists, data services, and
prototyping of integration solutions.
7 Data Virtualization and the New BI Challenges

This section summarizes how data virtualization can meet the new BI challenges listed in Section 2:
Productivity Improvement: Because data virtualization supports on-demand data integration
there is less need to develop derived databases, such as data marts. This clearly shortens the
chain of databases and results in systems that are quicker to develop and easier to maintain.
Self-Service BI: When virtual data marts have been developed, change requests are easy to
implement by the IT department. Its just a matter of changing the definitions of the logical tables.
No physical data marts have to be unloaded and reloaded, and no ETL jobs have to be changed.
More on this in Section 8.
Operational Intelligence: As the previous section describes, the operational data warehouse is an
application area for data virtualization. Users can be given access to operational data (through
logical tables) without the need to copy that data. In addition, they can integrate the operational
data with historical data stored in a physical data warehouse.
Big Data, Hadoop, and NoSQL: Most data virtualization servers support direct access to Hadoop
and NoSQL systems. This implies that data can stay where it is and can still be analyzed with
reporting and analytical tools using SQL. Big data is pulled through the chain.
Systems in the Cloud: Data virtualization can hide cloud technology. It can hide the location
where systems are running. Even migrating systems from one cloud to another can be done
transparently. Data virtualization makes the cloud transparent.
Data in the Cloud: If an external data source has a well-defined API, data virtualization servers
can access it. With a data virtualization server, reports can run directly on these external data
sources, and external data can be integrated, transformed, and cleansed in the same way as
internal enterprise data.
8 Data Virtualization Simplifies Sharing of Integration Specifications

The Dangers of Not Sharing Specifications Rarely ever do all the users of
In many BI systems
an organization use the same reporting tool. Usually a wide range of
specifications are not shared
tools is in use. Unfortunately, tools dont share integration,
across different BI tools.
transformation, or cleansing specifications; see Figure 6. A solution
developed for one tool cannot be re-used by another tool. (Note that in numerous BI systems, even if
users are using the same tool, specifications are not shared either.) This requires that the solution is
implemented in many different tools, leading to a replication of all integration, transformation, and
16
cleansing specifications.
BI tool 1
BI tool 2
repository
integration
solution
source 1
source 2
repository
integration
solution
source 3
Figure 6 BI tools dont
BI tool 3
source 4
repository
integration
solution
source 5
share integration,
transformation, and
cleansing specifications;
they all have their own
central repository.
source 6
For example, a user can define the concept of a good customer based on the total number of orders he
has placed, the average value of orders, the number of products returned, the age of the orders, and so
on. He can enter filters and formulas to distinguish the good ones from the bad ones. If another user
needs a similar concept but uses another tool, he has to define it himself using his own tool. The
consequence is that the wheel is reinvented over and over again.
It must be clear that no sharing of specifications lowers the agility level of a BI system, reduces the
productivity of BI developers, and lowers the correctness and consistency of report results.
Self-Service BI Tools Have No Central Repository A drawback of many
Often in self-service BI tools,

self-service BI tools is that there is no real central repository where
users are reinventing the
specifications are stored and can be shared by users and reports. For
wheel.
example, if two users want to integrate the same two data sources,
they have to define their own solutions. So, despite that these users are working with the same tool, there
is a limited sharing of integration and transformation specifications. They are reinventing the wheel over
and over again.
Plus, integrating data sources is not always easy. Some data sources can have highly complex structures
that require in-depth knowledge of how they organize data. Some of that logic may even be deeply
hidden in the data structures and data itself. So, the question is whether in such a situation the correct
form of integration is implemented? Integrating systems is not always as easy as drag and drop.
Complexity of integration should never be trivialized.
Data Virtualization to the Rescue By deploying data virtualization, many

Data virtualization allows
specifications can be implemented centrally and can be shared. They
sharing of specifications.
can even be shared by different tools from different vendors; see
Figure 7. The definition of what a good customer is, has to be entered only once with the data
virtualization server and can then be shared by all users. It can even be shared across all BI tools in use.
It must be clear that this sharing of specifications raises the agility level of a BI system. If the definition of a
good customer changes, it only has to be changed in one spot. It also increases the productivity of BI
17
developers, and improves the correctness and consistency of report results. In addition, if the integration
of some data sources is complex, it can be implemented by IT specialists (using logical tables) for all BI
users. So, no reinvention of the wheel, but sharing and reusing specifications.
BI tool 1
BI tool 2
BI tool 3
repository
Figure 7 With data

virtualization,
specifications are stored
in a central repository and
can be shared.
data virtualization
with shared specifications
source 1
source 2
source 3
source 4
source 5
source 6
With respect to self-service BI tools, if users prefer to define integration specifications themselves, they
can still do that by accessing the low-level logical tables. Instead of defining concepts with their own tool,
such as good customer, they can create re-usable specifications themselves in the data virtualization
server. They will experience the same level of easy-to-use and flexibility they are used to with their selfservice tools. Changing specifications in the data virtualization server is as easy as changing comparable
specifications in a self-service BI tool.
9 Overview of Red Hats JBoss Data Virtualization Server

History of Red Hat JBoss Data Virtualization Red Hats product for data virtualization called JBoss Data
Virtualization (JDV) is not a brand new, but mature product. It started its life as a closed source product
called MetaMatrix. The vendor was founded in 1998 as Quadrian and renamed it later to MetaMatrix.
They released their first data virtualization product in 1999. Before the product received much attention
in the market and before data virtualization became popular, they were acquired by Red Hat in June 2007.
Red Hat took a few years to transform the closed source product to an open source one.
Initially, Red Hats product was released under the name JBoss Enterprise Data Services Platform. This
name was changed at the end of 2013. Currently, there are two versions of the product available: Teiid5 is
the community edition, and JBoss Data Virtualization is the enterprise edition. Noteworthy is that Red
Hats product is the only open source data virtualization server currently available. This section focuses on
JDV only.
Collaborative and Rapid Development Each data virtualization product offers on-demand data viewing. When
a logical table is created, users can study its (virtual) contents right away. JDV also supports on-demand
5
Teiid is a type of lizard. In addition, the name contains the acronym EII, which stands for Enterprise Information
Integration. The term EII can be seen as a forerunner for data virtualization.
18
data visualization through dashboarding. With this feature, the virtual contents of logical tables can be
visualized as bar charts, pie charts, and so on, right after the table has been created; see Figure 8.
Figure 8 A sceenshot of JBoss Data Virtualization showing data visualization through dashboards.
This on-demand data visualization feature allows for collaborative development. Analysts and business
users can sit together and work on the definitions of logical tables together. The analysts will work on the
definitions and the models (which may be too technical for some users) and the users will see an intuitive
visualization of the datatheir data. Because of this collaborative development, less costly development
time will be lost on incorrect implementations. It leads to rapid development.
Design Environment The logical tables can be defined using a graphical and easy-to-use design
environment. Figure 9 contains a screenshot showing the five logical tables and their relationships. The
familiar JBoss Developer Studio is an add-on and can be used as design and development environment.
Lineage and Impact Analysis JDV stores all the definitions of concepts, such as data sources, logical tables,
and physical tables, in one central repository. This makes it easy for JDV to show all the dependencies
between these objects. This feature helps, for example, to determine what the effect will be if the
structure of a source table or logical table changes; which other logical tables must be changed as well? In
other words, JDV offers lineage and impact analysis.
Accessing Data Sources JDV can access a long list of source systems, including most well-known SQL
databases servers (Oracle, DB2, SQL Server, MySQL, and PostgreSQL), enterprise data warehouse
platforms (Teradata, Netezza, and EMC/Greenplum), office tools (Excel, Access, and Google
Spreadsheets), applications (SAP and Salesforces.com), flat files, XML files, SOAP and REST web services,
and OData services.
19
Figure 9 A sceenshot of JBoss Data Virtualization showing the relationships between tables.
Accessing Big Data Hadoop and NoSQL For accessing big data stored in Hadoop or NoSQL systems, JDV
comes with interfaces for HDFS and MongoDB. With respect to Hadoop, JDVs implementation works via
Hive. So, JDV sends the SQL query coming from the application to Hive, and Hive translates that query into
MapReduce code, which is then executed in parallel on HDFS files. In the case of MongoDB, hierarchical
data is flattened to a relational table.
User-Defined Functions For logic too complex for SQL, developers can build their own functions. Examples
may be complex statistical functions or functions to turn a complex value into a set of simple values.
These user-defined functions can be developed in Java and can be invoked from SQL statements. Invoking
the UDFs is comparable to invoking built-in SQL functions.
Two Languages For Developing Logical Tables Logical tables can be defined using SQL queries or stored
procedures. The SQL dialect supported is rich enough to specify most of the required transformations,
aggregations, and integrations. Stored procedures may be needed, for example, when non-SQL source
systems are accessed that require complex transformations to turn non-relational data into more
relational structures.
Query Optimizer To improve query performance, JDVs query optimizer supports various techniques to
push down most or all of the query processing to the data sources. It supports several join processing
strategies, such as merge joins and distributed joins. The processing strategy (or query plan) selected by
the optimizer can be studied by the developers. The optimizer is not a black box. This openness of the
optimizer is very useful for tuning and optimizing queries.
20
Caching JDV supports two forms of caches: internal materialized caches and external materialized caches.
With internal materialized caches, the cache is kept in memory. The advantage of using internal
materialized caches is fast access to the data, but the disadvantage is that memory is limitednot all data
can be cached. With external materialized caches, data is stored in a SQL database. In this case, there is no
restriction on the size of the cached data. But it will be somewhat slower than the memory alternative.
Publishing Logical Tables JDV supports a long list of APIs through which logical tables can be accessed
which includes JDBC, ODBC, SOAP, RESR, and OData.
Data Security JDV supports the four different forms of data access security specified in Section 5.
Privileges such as select, insert, update, and delete can be granted on the table level, individual column
level, record level, and even on individual value level. This makes it possible to let JDV operate as a data
security firewall. In addition, when logical tables are published using particular APIs, security aspects can
be defined as well. For example, developers can publish a logical table using SOAP with WS-Security
extended.
Embeddable All the functionality of JDV can be invoked through an open API. This means that applications
can invoke the functionality of JDV. This makes JDV an embeddable data virtualization server. Vendors and
organizations can use this API to develop embeddable solutions.
Summary JBoss Data Virtualization is a mature data virtualization server that allows organizations to
develop BI systems with more agile architectures. Its on-demand data integration capabilities makes it
ready for many application areas:
Virtual data marts
Extended data warehouse
Big data analytics
Operational data warehouse
Offloading cold data
Cloud transparency
JDV allows the development of agile BI systems that are ready for the new challenges BI systems are faced
with.
21
About the Author Rick F. van der Lans

Rick F. van der Lans is an independent analyst, consultant, author, and lecturer specializing in data
warehousing, business intelligence, database technology, and data virtualization. He works for
R20/Consultancy (www.r20.nl), a consultancy company he founded in 1987.
Rick is chairman of the annual European Enterprise Data and Business Intelligence Conference (organized
in London). He writes for the eminent B-eye-Network.com6 and other websites. In 2009, in a number of
articles7 all published at BeyeNetwork.com, he introduced the business intelligence architecture called the
Data Delivery Platform which is based on data virtualization.
He has written several books. His latest book8 Data Virtualization for Business Intelligence Systems was
published in 2012. Published in 1987, his popular Introduction to SQL9 was the first English book on the
market devoted entirely to SQL. After more than twenty five years, this book is still being sold, and has
been translated in several languages, including Chinese, German, Italian, and Dutch.
For more information please visit www.r20.nl, or email to rick@r20.nl. You can also get in touch with him
via LinkedIn and via Twitter @Rick_vanderlans.
About Red Hat, Inc.

Red Hat is the worlds leading provider of open source solutions, using a community-powered approach to
provide reliable and high-performing cloud, virtualization, storage, Linux, and middleware technologies.
Red Hat also offers award-winning support, training, and consulting services. Red Hat is an S&P company
with more than 70 offices spanning the globe, empowering its customers businesses.
See http://www.b-eye-network.com/channels/5087/articles/
See http://www.b-eye-network.com/channels/5087/view/12495
8
R.F. van der Lans, Data Virtualization for Business Intelligence Systems, Morgan Kaufmann Publishers, 2012.
9
R.F. van der Lans, Introduction to SQL; Mastering the Relational Database Language, fourth edition, Addison-Wesley, 2007.
7

RHJB Rethink Data Integration Delivering Agile Bi Systems Data Virt

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

RHJB Rethink Data Integration Delivering Agile Bi Systems Data Virt

Загружено:

Авторское право:

Доступные форматы

Re-think Data Integration:

Delivering Agile BI Systems

Rick F. van der Lans

The New Challenges for BI Systems

Current BI Systems and the New Challenges

On-Demand Integration with Data Virtualization

Under The Hood of Data Virtualization Servers

BI Application Areas for Data Virtualization

Data Virtualization and the New BI Challenges

Data Virtualization Simplifies Sharing of Integration Specifications

Overview of Red Hats JBoss Data Virtualization Server

About the Author Rick F. van der Lans

About Red Hat, Inc.

Copyright 2014 R20/Consultancy, all rights reserved.

Re-Think Data Integration: Delivering Agile BI Systems with Data Virtualization

BI has become a key

Copyright 2014 R20/Consultancy, all rights reserved.

Re-Think Data Integration: Delivering Agile BI Systems with Data Virtualization

2 The New Challenges for BI Systems

Productivity Improvement If IT wants to assist organizations to stay competitive and cost-effective,

Copyright 2014 R20/Consultancy, all rights reserved.

Re-Think Data Integration: Delivering Agile BI Systems with Data Virtualization

needs. Finally, all this work leads to an implementation.

Operational Intelligence There was a time when users of data

Gartner, October 2012; see http://techcrunch.com/2012/10/17/big-data-to-drive-232-billion-in-it-spending-through-2016/

Copyright 2014 R20/Consultancy, all rights reserved.

Re-Think Data Integration: Delivering Agile BI Systems with Data Virtualization

Data in the Cloud The data sources of a data warehouses used to be

3 Current BI Systems and the New Challenges

Classic BI Systems The architectures of most BI systems resemble a

The architectures of most BI

Copyright 2014 R20/Consultancy, all rights reserved.

Re-Think Data Integration: Delivering Agile BI Systems with Data Virtualization

Figure 1 Most BI systems

consist of a chain of databases in

The chain of databases and

Data supply chains should be

Re-Think Data Integration: Delivering Agile BI Systems with Data Virtualization

should be fewer databases and fewer ETL jobs.

4 On-Demand Integration with Data Virtualization

Copyright 2014 R20/Consultancy, all rights reserved.

Re-Think Data Integration: Delivering Agile BI Systems with Data Virtualization

Data Virtualization Server

Integration with data

Copyright 2014 R20/Consultancy, all rights reserved.

Re-Think Data Integration: Delivering Agile BI Systems with Data Virtualization

The key advantage of on-demand integration via data virtualization is

Data Virtualization Does Not Replace ETL Sometimes data virtualization

Data virtualization and ETL

5 Under the Hood of Data Virtualization Servers

Importing Data Sources Before data in source systems can be accessed

A physical table is a wrapper

Copyright 2014 R20/Consultancy, all rights reserved.

Re-Think Data Integration: Delivering Agile BI Systems with Data Virtualization

Copyright 2014 R20/Consultancy, all rights reserved.

Re-Think Data Integration: Delivering Agile BI Systems with Data Virtualization

Figure 4 Logical tables are

Copyright 2014 R20/Consultancy, all rights reserved.

Re-Think Data Integration: Delivering Agile BI Systems with Data Virtualization

Figure 5 Logical tables can be

nested so that common

Copyright 2014 R20/Consultancy, all rights reserved.

Re-Think Data Integration: Delivering Agile BI Systems with Data Virtualization