Академический Документы
Профессиональный Документы
Культура Документы
Analytics:
Profiling the Use of Analytical
Platforms in User Organizations
BY WAYNE ECKERSON
Director of Research, Business Applications and Architecture Group, TechTarget, September 2011
Sponsored By:
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
EXECUTIVE SUMMARY
Executive Summary
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
Analytical platforms. To keep pace with the desire to store and analyze ever
larger volumes of structured data, relational database vendors have delivered
specialized analytical platforms that provide dramatically higher levels of price-performance compared with general-purpose
relational database management systems
Companies have been
(RDBMSs). These analytical platforms
storing and analyzing
come in many shapes and sizes, from softlarge volumes of data
ware-only databases and analytical applisince the advent of
ances to analytical services that run in a
third-party hosted environment. Almost
the data warehousing
three-quarters (72%) of our survey responmovement in the
dents said they have implemented an anaearly 1990s.
lytical platform that fits this description.
In addition, new technologies have
emerged to address exploding volumes of
complex structured data, including Web
traffic, social media content and machine-generated data, such as sensor and
Global Positioning System (GPS) data. New nonrelational database vendors
combine text indexing and natural language processing techniques with traditional database technology to optimize ad hoc queries against semi-structured data. And many Internet and media companies use new open source
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
EXECUTIVE SUMMARY
frameworks such as Hadoop and MapReduce to store and process large volumes of structured and unstructured data in batch jobs that run on clusters of
commodity servers.
Business users. In the midst of these platform innovations, business users
await tools geared to their information requirements. Casual usersexecutives, managers, front-line workersprimarily use reports and dashboards
that deliver answers to predefined questions. Power usersbusiness analysts,
analytical modelers and data scientists
Most business intelligence
perform ad hoc queries against a variety
of sources. Most business intelligence
(BI) environments have
(BI) environments have done a poor job
done a poor job meeting
meeting these diverse needs within a
these diverse needs
single, unified architecture. But this is
changing.
within a single, unified
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
RESEARCH BACKGROUND
Research Background
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
THE PURPOSE OF
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
RESEARCH BACKGROUND
Analyst
Administrator
Developer
RESEARCH
BACKGROUND
Other
0
10
15
20
25
30
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
North America
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .66.7%
Europe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16.5%
Other
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16.9%
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24.8%
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22.8%
ARCHITECTURE
FOR BIG DATA
ANALYTICS
Retail
Consulting
Banking
Insurance
Computers
Telecommunications
Software
RECOMMENDATIONS
Manufacturing
Health Care Payor
Hospitality/Travel
Other
0
10
15
20
25
30
35
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
a lot of talk about big data in the past year, which I find a bit
puzzling. Ive been in the data warehousing field for more than 15 years, and
data warehousing has always been about big data.
Back in the late 1990s, I attended a ceremony honoring the Terabyte Club,
a handful of companies that were storing more than a terabyte of raw data in
their data warehouses. Fast-forward more than 10 years and I could now be
attending a ceremony for the Petabyte Club. The trajectory of data acquisition
and storage for reporting and analytical applications has been steadily
expanding for the past 15 years.
So whats new in 2011? Why are we
are talking about big data today? There
are several reasons:
The growth in data is
THERE HAS BEEN
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
2005
2006
2007
2008
2009
2010
2011
2012
SOURCE: IDC DIGITAL UNIVERSE 2009: WHITE PAPER, SPONSORED BY EMC, 2009.
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
automobile valuation company Kelley Blue Book is now collecting and storing
Web traffic data in-house so it can combine that information with sales and
other corporate data to better understand customer behavior, according to
Dan Ingle, vice president of analytical insights and technology at the company.
At the same time, virtualization technology is beginning to make it attractive
for organizations to consider moving
large-scale data processing outside
We are the beginning
their data center walls to private hosted
of an amazing world of
networks or public clouds.
RECOMMENDATIONS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
Big vs. small data. A valuable characteristic of big data is that it contains
more patterns and interesting anomalies than small data. Thus, organizations can gain greater value by mining large data volumes than small ones.
While users can detect the patterns in small data sets using simple statistical
methods, ad hoc query and analysis tools or by eyeballing the data, they need
sophisticated techniques to mine big data. Fortunately, these techniques
and tools already exist thanks to companies such as SAS Institute and SPSS
(now part of IBM) that ship analytical workbenches (i.e., data mining tools).
These tools incorporate all kinds of analytical algorithms that have been developed and refined by academic and commercial researchers over the past
40 years.
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
10
However, the road to big data analytics is not easy and success is not guaranteed. Analytical champions are still rare. Thats because succeeding with big
data analytics requires the right culture, people, organization, architecture and
technology (see Figure 6).
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
11
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
12
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
Top down. For the past 15 years, BI teams have built data warehouses that
serve the information needs of casual users (e.g., executives, managers and
front-line staff.) These top-down, report-driven environments require developers to know in advance what kinds of questions casual users want to ask and
which metrics they want to monitor. With requirements in hand, developers
create a data warehouse model, build extract, transform and load (ETL) routines to move data from source systems to the data warehouse, and then create
reports and dashboards to query the data warehouse (see Figure 7, page 14).
Whether by choice or not, power users who operate in an exclusively topdown BI environment are largely left to fend for themselves, using spreadsheets, desktop databases, SQL and data-mining workbenches. Business analysts generally find BI tools too inflexible and data warehousing data too
limited. At best, they might use BI tools as glorified extract engines to dump
data into Microsoft Excel, Access or some other analytical environment. The
upshot is that these analysts and data scientists generally spend an inordinate
amount of time preparing data instead of analyzing it and create hundreds if
not thousands of data silos that wreak havoc on information consistency from
a corporate perspective.
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
13
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
answer in advance because they are usually responding to emergency requests from executives and managers who need information to address new
and unanticipated events in the marketplace. Rather than focus on goals and
metrics, business analysts spend most of their time engaged in ad hoc projects,
or they work closely with business managers to optimize existing processes.
As you can see, there is a world of difference between a top-down and
bottom-up BI environment. Many organizations have tried to support both
types of processing within a single BI environment. But that no longer works
in the age of big data analytics. Forward-thinking companies are expanding
their data warehousing architectures and data governance programs to better
balance the dynamic between top-down and bottom-up requirements. (See
Analytic Architectures: Approaches to Supporting Analytics Users and Workloads,
a 40-page report by Wayne Eckerson, available for free download.)
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
14
NEXT-GENERATION BI ARCHITECTURE
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
Figure 8 represents the next-generation BI architecture, which blends elements from top-down and bottom-up BI into a single cohesive environment
that adequately supports both casual and power users. The top half of the diagram represents the classic top-down, data warehousing architecture that primarily delivers interactive reports and dashboards to casual users (although
the streaming/complex event processing (CEP) engine is new.) The bottom
half of the diagram adds new architectural elements and data sources that
better accommodate the needs of business analysts and data scientists and
make them full-fledged members of the corporate data environment.
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
15
SERVER ENVIRONMENT
Hadoop. The biggest change in the new BI architecture is that the data warehouse is no longer the centerpiece. It now shares the spotlight with systems
that manage structured and unstructured data. The most popular among
these is Hadoop, an open source software framework for building data-intensive applications. Following the example of Internet pioneers, such as Google,
Amazon and Yahoo, many companies now use Hadoop to store, manage and
process large volumes of Web data.
Hadoop runs on the Hadoop Distributed
File System (HDFS), a distributed file sysThe biggest change in
tem that scales out on commodity servers.
Since Hadoop is file-based, developers
the new BI architecture
dont need to create a data model to store
is that the data
or process data, which makes Hadoop ideal
warehouse is no longer
for managing semi-structured Web data,
which comes in many shapes and sizes. But
the centerpiece.
because it is schema-less, Hadoop can be
used to store and process any kind of data,
including structured transactional data and
unstructured audio and video data. However, the biggest advantage of Hadoop
right now is that its open source, which means that the up-front costs of
implementing a system to process large volumes of data are lower than for
commercial systems. However, Hadoop does require companies to purchase
and manage dozens, if not hundreds, of servers and train developers and
administrators to use this new technology.
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
16
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
Hadoop
NonHadoop
0
10
20
30
40
50
60
70
80
90
100
SOURCE: HADOOP AND INFORMATION MANAGEMENT: BENCHMARKING THE CHALLENGE OF ENORMOUS VOLUMES OF DATA: EXECUTIVE SUMMARY,
VENTANA RESEARCH, JUNE 23, 2011.
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
17
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
R Support analytics. The big data crowd (i.e., Internet developers) views
Hadoop primarily as an analytical engine for running analytical computations against large volumes of data. To query Hadoop, analysts currently
need to write programs in Java or other languages and understand
MapReduce, a framework for writing distributed (or parallel) applications.
The advantage here is that analysts arent restricted by SQL when formulating queries. SQL does not support many types of analytics, especially
those that involve inter-row calculations, which are common in Web traffic
analysis. The disadvantage is that Hadoop is batch-oriented and not conducive to iterative querying.
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
Nonrelational databases. While Hadoop has received a lot of press attention lately, its not the only game in town for storing and managing semi-structured data. In fact, an emerging and diverse set of products goes one step further than Hadoop and stores both structured and unstructured data within a
single index. These so-called nonrelational databases (depicted in Figure 8
supporting a free-standing sandbox) typically extract entities from documents, files and other databases using natural language processing techniques and index them as key value pairs for quick retrieval using a document-
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
18
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
centric query language such as XQuery. As a result, these products can give
users one place to go to query both structured and unstructured data.
This style of analysis, which some call unified information access, exhibits
many search-like characteristics. But instead of returning a list of links, the
systems return qualified data sets or reports in response to user queries. And
unlike Hadoop, the systems are interactive, allowing users to submit queries in
an iterative fashion so they can understand trends and issues.
These nonrelational systems complement Hadoop, an enterprise data wareNonrelational systems
house or both. For example, organizacan store both structured
tions might use Hadoop to transcribe
and unstructured data
audio files and then load the transcriptions into a nonrelational database for
within a single index,
analysis. Or they might replicate sales
giving users one place to
and customer data from a data warequery any type of data.
house and combine it with Web data in a
nonrelational database so power users
can find correlations between Web traffic
and customer orders without bogging
down performance of the data warehouse with complex queries. This type of
unified information access is critical in a growing number of applications.
For example, an oil and gas company uses MarkLogic to track the location of
ships at sea. The MarkLogic Server stores data from GPS, news feeds, weather
data, commodity prices, among other things, and surfaces all this data on a
map that users can query. For example, a user might ask, Show me all the
ships within this polygon (i.e., geographic area) that are carrying this type of
oil and have changed course since leaving the port of origin. The application
then displays the results on the map.
Data warehouse hubs. While Hadoop and nonrelational systems primarily
manage semi-structured and unstructured data, the data warehouse manages
structured data from run-the-business operational systems. Except for Teradata shops, many companies increasingly use data warehouses running on traditional relational databases as hubs to feed other systems and applications
rather than to host reporting and analysis applications.
For example, Dow Chemical, which maintains a large SAP Business Ware-
RECOMMENDATIONS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
19
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
house (BW) data warehouse, now runs all queries against virtual cubes that
run in memory using SAP BW Accelerator. By running our cubes in memory,
weve de-bottlenecked our data warehouse, said Mike Masciandaro, director
of business intelligence at Dow. Now our
data warehouse primarily manages batch
loads to stage data. Likewise, Blue Cross
Blue Shield of Kansas City has transMany companies
formed its IBM DB2 data warehouse into
increasingly use their data
a hub that feeds transaction and analytiwarehouse as hubs to
cal systems and implemented a Teradata
Data Warehouse Appliance 2650 to hanfeed other systems and
dle all reports and queries and support a
applications rather than
self-service BI environment.
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
20
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
R In-memory BI sandbox. Some deskop BI tools, such as QlikView or PowerPivot, maintain a local data store in memory to support interactive dashboards or ad hoc queries. These sandboxes are popular among analysts,
because they generally let them pull data from any source, quickly link
data sets, run super-fast queries against data held in memory, and visually
interact with the results, all without much or any IT intervention. Also,
some server-based environments, such as SAP HANA, store all data in
memory, accelerating queries for all types of BI users.
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
21
enormous volumes of a single discrete event type, such as a sensor data generated by a pipeline or medical device. Streaming engines typically ingest an
order of magnitude more events per second than CEP engines but typically
only pull data from a single source. However, streaming and CEP engines are
merging in functionality as vendors seek to offer one-stop shopping for continuous intelligence capabilities.
EXECUTIVE
SUMMARY
CLIENT ENVIRONMENT
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
22
Power users. The biggest change in the new analytical BI architecture is how
it accommodates the information needs of power users. It gives power users
many new options for consuming corporate data rather than creating countless spreadmarts. A power user is a person whose job is to crunch data on a
daily basis to generate insights and plans. Power users include business analysts (e.g., Excel jockeys), analytical modelers (e.g., SAS programmers and
statisticians) and data scientists (e.g., application developers with business
process and database expertise.) Power users have five options as depicted in
Figure 8:
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
23
R Query Hadoop. If power users want to analyze big data in its raw or lightly
aggregated form, they can query Hadoop directly by writing MapReduce
code in a variety of languages. However, power users must know how to
write parallel queries and interrogate the structure of the data prior to
querying it since Hadoop data is schema-less. Vendors are also beginning
to ship BI tools that access Hadoop through Hive or Hbase and return data
sets to the BI tool.
EXECUTIVE
SUMMARY
Data integration. The new BI architecture also places a premium on managing and manipulating data flows between systems. This calls for a versatile
set of data integration tools that can access any type of data (e.g., structured,
semi-structured and unstructured), load it into any target (e.g., Hadoop, data
warehouse or in-memory database),
navigate data sources that exist onpremises or in the cloud, work in
batch and real time, and handle both
Data integration products
small and large volumes of data.
that run on both relational
Data integration tools for Hadoop
and nonrelational platforms
are in their infancy but evolving fast.
and maintain a consistent
The open source community has
developed Flume, a scalable distribset of metadata across
uted file system that collects, aggreboth environments will
gates and loads data into the HDFS.
reduce overall training
But longtime data integration vendors, such as Informatica, are also
and maintenance costs.
converting their visual design tools
to interoperate with Hadoop. That
way, ETL developers can use familiar
tools to extract, load, parse, integrate, cleanse and match data in Hadoop by
generating MapReduce code under the covers.
In this respect, Hadoop is both another data source for ETL tools as well as a
new data processing engine geared to handling semi-structured and unstructured data. Data integration products that run on both relational and nonrelational platforms and maintain a consistent set of metadata across both environments will reduce overall training and maintenance costs.
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
24
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
Dollar General. That was the case with Dollar General, a discount retailer
that wanted to purchase an analytical system to supplement its Oracle data
warehouse, which could not store atomic-level point-of-sale (POS) data from
its 9,500 stores nationwide. With a reference from a consumer products partner, Dollar General decided to implement an analytical platform from services
provider, 1010data. The product, which is accessed via a Web browser, offers
an Excel-like interface that provides native support for time-series data and
analytical functions. Within five weeks, Dollar General was running daily
reports against atomic-level POS data, according to Sandy Steier, executive
vice president and co-founder of 1010data.
A year later, Dollar General decided to replace its Oracle data warehouse
and conducted a proof of concept with several leading analytical platform
providers. 1010data, which participated in the Bake-Off, demonstrated superior performance and now runs Dollar Generals entire data warehouse.
And while Dollar General didnt set out to purchase an analytical service,
that proved a smart move. Besides quick deployment times and reduced internal maintenance costs, the analytical service made it easier for Dollar General
to open up its data warehouse to suppliers, which now use it to track sales and
make recommendations for product placement and promotions, Steier said.
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
25
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
Like most things, there are exceptions to this rule. For example, Oracle Exadata runs on Oracle 10g and, as such, it supports both
transactional and analytical processing, often with superior performance in both realms compared with standard Oracle 10g
installations.
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
26
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
TECHNOLOGY
DESCRIPTION
VENDOR/PRODUCT
Massively parallel
processing analytic
databases
Columnar
databases
Analytical
appliances
Preconfigured hardware-software
systems designed for query
processing and analytics that
require little tuning.
Analytical bundles
In-memory
databases
Analytical services
1010data, Kognitio
Nonrelational
CEP/streaming
engines
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
27
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
R MPP. Most analytical platforms spread data across multiple nodes, each
containing its own CPU, memory and storage and connected to a highspeed backplane. When a user submits a query or runs an application, the
shared nothing system divides the work across the nodes, each of which
process the query on its piece of the data and ship the results to a master
node that assembles the final result and sends it to the user. MPP systems
are highly scalable, since you simply add nodes to increase processing
power. And if the nodes run on commodity servers, as many MPP systems
today do, then this scalability is more cost-effective than MPP systems
running on proprietary hardware or symmetric multiprocessing systems,
which require big, expensive boxes to scale.
R Balanced configurations. Analytical platforms optimize the configuration
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
28
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
of CPU, memory and disk for query processing rather than transaction
processing. Analytical appliances essentially hard wire this configuration
into the system and dont let customers change it, whereas analytical bundles or analytical databases (i.e., software-only solutions) allow customers
to configure the underlying hardware to match unique application requirements. Analytical appliances offer convenience and ease of use while analytical databases offer flexibility.
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
29
in memory, many analytical platforms are expanding their memory footprints to speed query processing.
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
Hadoop and NoSQL. Some may argue whether Hadoop and the nonrelational databases are analytical platforms. While they dont store data in rows and
columns, both are well-suited to process large volumes of data for analytical
purposes. And most use an MPP architecture that scales out on commodity
servers. And some, such as MarkLogic, are full-fledged databases that support
transactional integrity.
Hadoop in particular differs significantly from most analytical platforms. As
a batch system, its not focused on optimizing query performance like other
analytical platforms, and thus, does not implement many of the characteristics
in the above bulleted list. However, Hadoops biggest value is that its open
source and so can process large volumes of data in a cost-effective way. And
like many nonrelational systems, it is schema-less, giving administrators
greater flexibility to change data structures without having to spend weeks or
months rewriting a data model.
RECOMMENDATIONS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
30
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
Given this definition, almost three-quarters (72%) of our survey respondents said that they had purchased or implemented an analytical database.
While the growth of the analytical platform market has been strong, this
72% figure seems a tad high, given that a majority of analytical database
products have been on the market for less than five years. Upon closer investigation, despite our definition, a sizable number of respondents when asked to
name their analytical platform identified a general-purpose database, in particular Microsoft SQL Server and Oracle (non-Exadata). Regardless, the data
still shows that many companies are turning to specialized analytical platforms to better meet their analytical requirements.
Non-customers. Among respondents that havent purchased an analytical
platform, 46% have no plans to do so, 42% are exploring the idea and just
12% are currently evaluating vendors. On the whole, about 75% of respondents will have an analytical platform in the near future (see Figure 10, page
33).
RECOMMENDATIONS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
31
10
20
30
40
50
RESEARCH
BACKGROUND
DEPLOYMENT OPTIONS
WHY BIG DATA?
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
Our survey grouped analytical platforms into four major categories to make it
easier to compare and contrast various product offerings:
1. Analytical databases: They can be described as software-only analytical
platforms that run on a variety of hardware that customers purchase. Customers install, configure and tune software, including the analytical database,
before they can use the analytical system. Most MPP analytical databases,
columnar databases and in-memory databases listed in Table 1 qualify as analytical databases.
2. Analytical appliances: These are hardware-software combinations
designed to support ad hoc queries and other types of analytical processing.
Analytical appliances tightly integrate the hardware and software, often using
proprietary components, to optimize performance and minimize the need for
tuning. Analytical bundles, which consist of standalone hardware and software products that a vendor ships as a package, also qualify as analytical
appliances. Bundles give administrators more flexibility to tune the system but
sacrifice deployment speed and manageability.
3. Analytical services: Rather than deploy an analytical platform in a customers data center, an analytical service enables customers to house the system in an off-site hosted environment or public cloud. This eliminates up-front
capital expenditures and lessens maintenance.
4. File-based analytical system: This generally refers to Hadoop, but we also
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
32
Analytical service
0
10
20
30
40
50
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
lumped NoSQL or nonrelational systems into this category, although its not
entirely accurate, since nonrelational systems are databases. However, since
both are used to store and analyze large volumes of unstructured data and
dont require an up-front schema design, they share more similarities than differences.
Given these categories, most analytical platform customers have either purchased or implemented analytical databases (46%) or analytical appliances
(49%). Many fewer have implemented a file-based analytical system (10%)
or analytical service (5%). (See Figure 11.)
Looking under the covers, analytical database customers are most likely to
have purchased Microsoft SQL Server or Oracle, while appliance customers
have purchased Teradata Active DW, a Teradata Appliance, or Netezza. Analytical services customers subscribed to a host of different vendors, while customers of file-based analytical systems were most likely to purchase a Hadoop
distribution from Cloudera, Apache or EMC Greenplum.
DEPLOYMENT STATUS
Drilling into each category further, we find that most of the respondents who
have purchased an analytical platform of some type have also deployed the
system. Roughly three-quarters of customers with analytical databases (73%)
and slightly more customers of analytical appliances (80%) have deployed
their systems. Not surprisingly, 100% of analytical services customers have
deployed their systems, but only 33% of customers with file-based analytical
systems have implemented theirs (see Table 2, page 34).
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
33
ANALYTICAL
APPLIANCE
ANALYTICAL
SERVICE
FILE-BASED
ANALYTICAL SYSTEM
Percentage deployed
72%
81%
100%
33%
4.0
4.9
1.3
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
With an analytical service, you simply create a data model (which in many
cases is optional) and load your data either by using the Internet or shipping a
disk to the provider, and the provider takes care of the rest. Thus, its much
easier to deploy an analytical service than the other options, accounting for
the 100% deployment figure in Table 2.
Analytical appliances generally take less time to deploy than analytical databases, which may account for the slightly higher deployment percentage. Analytical databases require customers to purchase and install hardware, which
may take many months and require multiple sign-offs from the IT, legal and
purchasing departments. Despite overwhelming press coverage of Hadoop,
few companies have implemented the system. Among those that have, most
are largely experimenting, which explains the low deployment percentage
compared with the other options.
The figures for average years deployed tell a similar story. As the new kid
on the block, Hadoop systems have only been deployed for an average of 1.3
years, followed by analytical services, which have been deployed an average
of three years. In contrast, analytical appliances have been deployed for 4.9
years and analytical databases for 4.0 years.
TECHNICAL DRIVERS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
34
Faster queries
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
Reduced costs
Higher availability
Quicker to deploy
ARCHITECTURE
FOR BIG DATA
ANALYTICS
Analytical database
Analytical appliance
Analytical service
File-based analytical system
Easier maintenance
Built-in analytics
10
20
30
40
50
60
70
80
90
100
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
35
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
36
Figure 13: Were you explicitly looking for [this deployment option]?
(Percentages based on respondents who answered Yes)
Analytical database
Analytical appliance
Analytical service
EXECUTIVE
SUMMARY
10
20
30
40
50
60
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
and money.
Agility and interoperability with existing technologies were key drivers for
the customer, said Rick Glick, vice president of customer and partner development at ParAccel.
BUSINESS APPLICATIONS
RECOMMENDATIONS
When push comes to shove, the value of an analytical platform is judged not
by its technical merits, but by the business applications it supports or makes
possible. The most popular business applications running on analytical platforms are customer analytics, followed by management reports, financial analytics, data integration, executive dashboards and risk analytics. This ranking is
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
37
based on summing the percentages of all four deployment options for each
requirement (see Figure 14).
Customer analytics
RESEARCH
BACKGROUND
Management reports
WHY BIG DATA?
Financial analytics
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
Data integration
Executive dashboards
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
Risk analytics
Analytical database
Analytical appliance
Analytical service
File-based analytical system
RECOMMENDATIONS
Logistics analytics
Cross-sell
10
20
30
40
50
60
70
80
90
100
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
38
10
20
30
40
50
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
Figure 14 also exposes stark differences in the business applications supported by each deployment option. For example, an analytical appliance is
more likely to be used for customer analytics (80%), risk analytics (40%),
and cross-sell recommendations (29%) than analytical databases, which are
more likely to be used for management reports (82%) and executive dashboards (60%). Thus, analytical databases are more likely to be used for traditional top-down reporting, while analytical appliances are used for bottom-up
analytics. This contrast makes sense when you remember that many of our
analytical database users are customers of Microsoft SQL Server and Oracle
10, which are best-suited to reporting, not analytics. Figure 14 also shows that
file-based systems are twice as likely to be used for Web traffic analysis
(46%) and social media analysis (15%) than the other options.
ROI. Not surprisingly, given its emphasis on analytics versus reporting, analytical appliances (35%) have a higher ROI than analytical databases (26%),
with analytical services close behind at 33%. Given their newness, file-based
systems delivered a surprisingly strong 25% ROI, but thats probably because
most file-based systems are open source and dont require an up-front investment in software (see Figure 15).
TECHNICAL ATTRIBUTES
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
39
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
ANALYTICAL
DATABASE
ANALYTICAL
APPLIANCE
ANALYTICAL
SERVICE
FILE-BASED
ANALYTICAL SYSTEM
Average number
of applications
5.9
11.3
8.0
4.9
Average number
of concurrent users
47.1
81.4
27.5
27.8
number of Teradata Active DW customers who responded to the survey. Teradata Active EDW is geared to supporting multiple workloads, serving as a data
warehouse, data mart and operational data store. In addition, its customers
have used the product for many years, and the longer a product is used, the
more applications it tends to support (see Table 3).
Data volumes. Analytical appliances and file-based systems are neck and
neck in terms of the amount of data they store. More than 40% of both sets of
customers use the systems to store between 10 TB and 100 TB of data, and
more than 14% of both options store over 100 TB. In contrast, 40% of analytical database customers have less than 1
TB of data (see Figure 16, page 41).
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
40
combine it with other corporate data, such as sales and orders, and derive
more value from it (see Figure 17).
Analytical database
Analytical appliance
Analytical service
File-based analytical system
100 GB to 1 TB
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
1 TB to 5 TB
5 TB to 10 TB
10 TB to 100 TB
ARCHITECTURE
FOR BIG DATA
ANALYTICS
100 TB+
10
20
30
40
50
60
70
80
90
100
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
Structured
Semi-structured
Analytical database
Analytical appliance
Analytical service
File-based analytical system
Unstructured
10
20
30
40
50
60
70
80
90
100
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
41
ARCHITECTURE
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
Staging area
Data warehouse
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
Analytical database
Analytical appliance
Analytical service
File-based analytical system
Development/test
RECOMMENDATIONS
Prototyping
10
20
30
40
50
60
70
80
90
100
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
42
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
TECHNICAL REQUIREMENTS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
43
critical requirement for all options except the file-based system. This makes
sense, since most Hadoop developers write custom code in Java, Perl or
some other language to construct queries rather than use packaged BI tools.
RESEARCH
BACKGROUND
Supports preferred
ETL/BI tools
Automated
distribution of data
WHY BIG DATA?
Use of commodity
servers
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
MPP
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
Supports unstructured
data
Analytical database
Analytical appliance
Analytical service
File-based analytical system
Mixed workload
PROFILING THE USE
OF ANALYTICAL
PLATFORMS
RECOMMENDATIONS
Open source
Supports
MapReduce
10
20
30
40
50
60
70
80
90
100
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
44
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
However, BI and ETL vendors are extending their products to interoperate with
Hadoop, so this will undoubtedly change, since its often easier to use tools
than write code.
Another variation is that file-based customers are much more interested in
commodity servers, open source and MapReduce than customers of
other deployment options. This makes sense, since all three requirements are
critical elements of a Hadoop ecosystem. In contrast, analytical appliances are
concerned with MPP, fast-loading utilities and mixed workload functionality.
This aligns with the predominant needs
of Teradata Active DW customers, who
constituted a large portion of the analytiIt came as no surprise
cal appliance respondents.
VENDORS
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
45
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
Successful POC
Analytical database
Analytical appliance
Analytical service
File-based analytical system
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
Pricing
Customer references
PROFILING THE USE
OF ANALYTICAL
PLATFORMS
Incumbent vendor
RECOMMENDATIONS
Other
10
20
30
40
50
60
70
80
90
100
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
46
RECOMMENDATIONS
Recommendations
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
Implement a new BI architecture. The BI architecture of the future incorporates traditional data warehousing technologies to handle detailed
transactional data and file-based and nonrelational systems to handle
unstructured and semi-structured data. The key is to integrate these systems
into a unified architecture that enables casual and power users to query,
report and analyze any type of data in a relatively seamless manner. This unified information access is the hallmark of the next generation BI architecture.
More immediately, companies are using Hadoop to preprocess unstructured
data so that it can be loaded and integrated with other corporate data for
reporting and analysis. This allows BI and ETL users to use familiar tools to
query and analyze data.
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
47
RECOMMENDATIONS
EXECUTIVE
SUMMARY
RESEARCH
BACKGROUND
BIG DATA
ANALYTICS:
DERIVING VALUE
FROM BIG DATA
ARCHITECTURE
FOR BIG DATA
ANALYTICS
PLATFORMS FOR
RUNNING BIG DATA
ANALYTICS
RECOMMENDATIONS
BIG DATA ANALYTICS: PROFILING THE USE OF ANALYTICAL PLATFORMS IN USER ORGANIZATIONS
48
About Pentaho:
Pentaho is the business analytics company providing power for technologists and rapid insight
for users. Powerful analytics are made easy with Pentahos cost-effective full suite of
capabilities for data access, integration, discovery, analysis, visualization and mining.