Академический Документы
Профессиональный Документы
Культура Документы
By Mark Albala
Introduction
The disciplines of Business Intelligence have always been about delivering collaborative insight
to those chartered to derive
the course of action for their
respective organizations.
We are now in the midst of a
metamorphosis to the fourth
attempt of this goal, with
the basic assumptions
driving the solution set very
different than they were
when the basic ideas forging
this industry were derived.
While the overall disciplines of business intelligence have matured and have to a large degree
met the demands of organizational stewards, the underlying architecture of the solution and
the processes employed to interface to the solution set have been enhanced rather than being
overhauled. Companies and vendors are realizing that the limits of scalability are being
reached and are rapidly drawing to the conclusion that an overhaul of the basic underlying
architecture is required.
Just as important as the basic underlying architecture is the adoption of repeatable processes
that can ensure accurate data while adopting processes that publish data and modify the
underlying model used to navigate data much more frequently than anyone is used to in
today’s world. One of the driving tenets of the metamorphosis of the disciplines categorized as
data management is how to migrate what has been up to today a batch oriented means to
publish data to something that is much more akin to an on‐line environment.
There has been much coverage in the media about the insurgence of BI 2.0, which covers a lot
of ground for companies to ponder. Some key attributes of BI 2.0, which is the fourth attempt
at delivering the enlightened enterprise, is a natural progression of arming organizational
stewards with the capabilities of quickly garnering insight from a deluge of data. However, over
the life of this industry, the definition of quickly and deluge of data has changed, and the
number of people who are organizational stewards have increased as organizations have
flattened. From a historical perspective, the changes in each of the generations of the suite
intended to deliver on the promise of an enlightened enterprise is as follows:
Some of the basic drivers for this overhaul are described below:
Initial Problem Set Current Problem Set
Basic Challenges
Some of the key challenges practitioners and their stakeholders often discuss are:
• The rapidly increasing deluge of data made available to and burying organizational
stewards. While this data is intended to gain insight and steer a course for an
organization, problems, opportunities and blind spots have lots of space to hide.
• The increased speed of communications in a global economy which allow others to copy
your products and innovations at a cheaper cost with increasing regularity
• The increased speed of communications in a global economy, which provides for global
information transparency in an increasingly small amount of time, leaves very short
reaction times to identify, gain insight, determine an appropriate action and pounce on
market opportunities and misfires within your own organization.
• A lack of trust in the published information, either because of:
o Not understanding its lineage
o Data quality concerns
o Outliers which skew insight gained from data
o Too many copies of information masquerading as the single version of the truth,
in both official systems of record and desktop versions of systems of record
managed through desktop tools
o Processes employed for data quality assurance which were designed for much
smaller amounts of data published much less frequently
• Tools made available for analyzing data, whose core value proposition was for quickly
identifying insight, challenged by the sheer volume of data published and made
available for analysis (the three clicks to find anything does not work when hierarchies
can have one billion or more members)
• Lack of a repeatable process which garners trustworthiness of data. Most of the
consumers of analytical data describe the process used today as a black box into which
they have no insight. They believe the data is right, but the fact remains that they
validate data before using it. In a world where communications is greatly accelerated
and information transparency is reached quite rapidly thanks to the successful adoption
of the internet, the time taken to validate data is an opportunity lost. The only way to
regain the ability to benefit from these market based opportunities is to deliver what is
believed to be trustworthy information by the those accountable for stewarding
organizations.
The purpose of this writing is to discuss core disconnects and strategies to remediate the
disconnects. Many are calling the existence of these disconnects BI 2.0, with varying solutions
purporting to offer the unique solution. While the tools have and will continue to mature into
the BI 2.0 framework discussed through the industry, the processes employed by companies
are inadequate to help springboard companies towards into the impending metamorphosis of
Business Intelligence and Data Warehousing.
Information volume and complexity
The sheer volume of
data has and will
continue to increase.
According to IDC and
other sources, while
information growth was
looked at
problematically less
than 10 years ago when
it was doubling every 2
½ years, the growth
rates anticipated
challenge our processes
in every way (a six fold
increase of information
from 2007 – 2010). Energy costs for storage, physical hardware costs, and other such issues are
important, but making such volumes of data digestible for rapid insight to those tasked to
identify and pounce on opportunities in a marketplace that quickly equalizes to a state of
information transparency is a far bigger issue.
Complexity is a impediment to usability, growth and insight. Because our processes have not
matured in lock step with the rapid onslaught of data, organizational stewards look at the
processes we utilize to publish information as black boxes with no insight into information
lineage, accuracy or trustworthiness.
One of the ways to simplify is to depart from the past and recreate new processes married to
the tools and architecture of the embryonic BI 2.0 offerings from existing and upcoming
vendors.
Arming organizational stewards with insightful metrics, prioritized alerts and rule based
selection of scenarios based on alerts, all implemented as means to accelerate the processes
used to act on short lived opportunities are all methods to be deployed in the next generation
of BI.
A key departure from the past is to assure the integrity of information prior to its integration as
part of a repeatable process rather than as a post integration process. Such a repeatable
process requires the ability to garner insight from departures from baseline profiling runs and
outliers, and integrate remediation strategies during the data integration process.
The shift
There is a shift going on in the marketplace which is driving much of the change to a new
paradigm for the delivery of business intelligence. Some of the characteristics of this shift are
below:
Current Delivery Next Generation
Characteristic Characteristic
Batch Oriented Processes Minimization of Process
Latencies
Hierarchies
The processes employed to date have been scaled to their limits, and require an overhaul to
meet the demands of a rapidly increasing data population being made available for analysis.
Information relevance
The current processes employed today create an enterprise information model which is
comprehensive, structurally sound and difficult to enhance. The timelines to add information
to the information model is out of line with the organizational requirements of redirecting
organizational strategies and tactics in the global economy. In order to be of value to the
organization, it is imperative that the enterprise information model serving as the foundation
for data warehouses and marts published for analysis be more malleable. Too often it is heard
that adding data to the warehouse takes too long and is too expensive and requires too much
time from organizational stewards. A far more natural way of ensuring information relevancy
in the base of data made available for analysis is required.
According to Ted Friedman of Gartner, over 70% of the time required for publishing new data
to a warehouse is expended in percolating business rules. Starting the process of discerning
business rules through an automated discovery process is mandatory to ensure the malleability
of the analytic fabric used to steer organizations.
Publication Speed
If information is validated prior to its publication (many organizations have given up on
validating data published through a rigorous data quality assurance process because of the cost
in providing this quality assurance and the bottleneck imposed to the publication process), it is
an expensive and time consuming process. In many organizations, the time to assure the
quality of information used to steer the organization is too expensive a cost.
The tarnishing of the data quality image has a long life when misfires in data quality are found
by organizational stewards. These can be caused by miscoded or missed business rules
surfaced during the regular publication process (reorganizations are a common source of these
misalignments) or outliers in data which skew the summarization process if not identified.
In order to be able to ensure the continued relevance and use of data published for analysis, it
is imperative that high quality relevant information be published just in time for gaining insight
and steering courses to opportunities that appear in the marketplace prior to an equilibrium of
information transparency being reached in the global economy.
This ability requires the processes used for publishing information depart from the current
practices which are devised for discrete batches of highly defined repeatable production loads
of data highly following the rigors, controls and timeframes associated with production systems
like payroll and general ledger. Our organizational stewards cannot afford the time required to
validate the integrity of data and the proper codification and lineage of business rules used for
integrating data. The data used for analysis must be much more malleable and flexible and
meet the stringent quality controls demanded by our organizational stewards, which requires a
significant departure from the historical processes used to publish data for insight.
The speed of communications and the sheer volumes of data will require our departure from
the old batch ways adopted to synthesize, integrate, validate and publish information.
Petabytes of information (one million billion bytes) will become quite common, and while not
ready to adopt the concept of the end of theory (The end of Theory, Wired, June, 2008), ready
to expect severe disruption in the practices used to publish information should be expected,
with significant budgetary strains if workable return on investment models are not formulated.
Information Trustworthiness
In many organizations, best case is
the information used for analysis is
thought to be right, but the image of
a black box used to publish
information troubles those using this
information for key decisions that
chart the course of the organization.
A much better job of proving the
publication of high quality
information made available for
analysis is required. This involves
publishing the proof of validity of the data integration processes at all times, publishing quality
metrics for review by organizational stewards so they do not chase ghosts and adopting
processes that find data anomalies before stakeholders, who are far more intimate with the
data being published.
New tools and processes are required to rekindle the image of information trustworthiness,
which include:
• Conforming the information sources masquerading as the single version of the truth,
such as desktop controlled data
sources, portals, data warehouses
and external data which is used to
identify and pounce on
opportunities in the marketplace.
• A repeatable information quality
regimen which is run as part of the
regular integration processes,
thereby identifying changes to the
baseline data profiling and outliers
potentially skewing the aggregates
used as entry points to analysis.
Tools
A new family of tools are presented to the marketplace every seven to ten years which
represent a giant leap in capabilities from those previously existing. Such was the case when
Pilot, Comshare, TM20, Foresight, ADRS and others were displaced by the current data
warehousing, business intelligence and data mining capabilities deployed in many
organizations. For the past several years we have focused on extending the capabilities of the
warehouse by enhancing its performance so that the entire organization could share a single
version of the truth, and something happened. The volume, complexity and frequency of data
published to the data warehouse and exposed through business intelligence tools grew beyond
the wildest expectations of any of the visionaries who crafted the underlying architecture of the
overall solution, thereby requiring enhancements to the overall model used to gain insight by
assisting organizational stewards to gain insight from an ever increasing deluge of data at a
much greater level of efficiency and speed.
We are in the midst of this metamorphosis of the marketplace, with many of the major vendors
of BI suites losing independence and a host of small companies introducing revolutionary
capabilities to the marketplace. Having an extensible easily enhanced architecture and
infrastructure used for the deployment of tools used for the disciplines of business intelligence
is mandatory to fully take advantage of the next generation of tools.
Repeatable Process
The ability to quickly identify what doesn’t look right and spotlight it as something worthy of
the attention of organizational stewards is required to garner trustworthiness of data. Very few
organizations have made significant progress in adopting the repeatable process because they
have not adopted the core components of the tools and core processes necessary to have a
workable repeatable process. These components are:
• The ability to use profiling technologies to compare incoming data to a baseline to
understand what has changed from what was expected (the baseline). There is no time
to eyeball incoming data and top line validation of data is insufficient to garner the
necessary trustworthiness from stakeholders.
• The integration of a process that ensures prioritization and dispensing of suspect data is
mandatory to garner trustworthiness. Utilizing a form of workflow will alleviate letting
suspect data issues from falling through the cracks.
• Understanding what data anomalies are skewing the numeric data that is synthesized
for analysis for organizational stewards is mandatory. We are well past the point of
being able to review everything quickly through a slice and dice model which eventually
exposes issues in atomic data. Organizational stewards no longer have the patience to
go through the data to identify interesting tidbits because the slice and dice model takes
too long to surface these tidbits. What is necessary is a process to surface prioritized
items for review by organizational stewards, which requires a repeatable process to
prioritize based on generally accepted prioritization rules.
Summary
To summarize the core strategies to be employed,
• Re‐invent the processes used to identify relevant information published to the data
warehouse and publish high quality, relevant just in time information for the
accelerated decision processes practiced by organizational stewards.
• Adopt a repeatable data profiling and governance process that utilizes comparisons to
baseline profiling and identification of outliers.
• Increase the amount of automation used in the business rule discovery processes.
• Depart from a discrete batch oriented information integration processes that adopt the
rigors of production control associated with payroll and similar systems.
• Arm organizational stewards with the proof of properly loaded data, a clear
understanding of data lineage, and a clear understanding of outliers which could skew
entry points into analyses.
• Adoption of repeatable processes for the synthesis, integration, validation, prioritization
and publication of information is mandatory. Profiling technologies are a key
component to ensure that we can identify issues we don’t already know about.
• Adopt an extensible easily enhanced architecture and infrastructure to quickly adopt the
tools and techniques maturing from existing and new entrants to the disciplines of
business intelligence.