Вы находитесь на странице: 1из 49

The Big Big

Data Workbook
A practical guide to managing big data
and delivering transformative insights


Going Big 03 Part Three: Your Lean Big Data Supply Chain 28
Your Team 29
Part One: Preparing for Success 04 –– Five Key Team-Building Lessons 30
What You Need to Know 05 –– Setting Up Data Governance 33
–– Why Most Organizations Implement –– The Skills You Need and the Skills You Have 35
Big Data Projects 06 –– Understanding Big Data Tools 37
–– Why Big Data Projects Fail 07 Your Processes 38
–– How to Make Your Big Data Project Work 08 –– The Big Data Eight 39
Choosing the Right Project 10 Your Architecture 41
–– What the Right Project Looks Like 11 –– First Steps: Your Sandbox 42
–– Consider the Impact 13 –– The Ideal Big Data Architecture 43
–– Tactical Big Data Projects: Some Examples 15 Your Project Plan 44
–– Getting Going 47
Part Two: Set Your Strategy 16
Defining Your Goals 17 Further Reading 48
–– The Business Goals 18
–– IT’s Goals 20 About Informatica®49
Define Your Data Needs 22
–– What Data Do You Need? 23
–– Five Key Data Considerations 26

Tip: click to jump straight to any section.

Introduction 03

Going Big

Few technology trends have earned

the kind of hype big data has.

But then again, few technology trends

have offered enterprises so much
potential for transformation. Ever since
software began to envelope whole
business processes at the turn of the
century, it’s been clear: data has the
potential to transform enterprises.

Whether you’re starting a localized,

tactical initiative or planning a more
foundational endeavor for the whole
enterprise, this book will serve as a
practical guide on your journey.

Let’s dive in.


Part One

Preparing for Success

We’ve divided this book into three
parts. In this first part, we’re trying
to sharpen your vision so you can
pick the right project.
Part One / Preparing for Success 05

What You
Need to Know
Before digging into the specifics of your
own project, here are some lessons most
big data practitioners wish they’d known
before they started their projects.
Part One / Preparing for Success 06

Why Most Organizations

Implement Big Data Projects

When companies decide they’re In either case, it’s because they

going to tackle big data, it’s want to discover new insights to fuel
usually for one of two reasons. data-driven digital transformation.

They want to deliver existing analytics That’s a great reason to get serious
more efficiently by offloading about big data. But getting specific
infrequently used data and processes and identifying your own business
from expensive systems to new objectives (and success metrics)
data platforms like Hadoop. will be critical to your success.

They want more holistic analytics

around a business domain, such as
customers, products, patients, or risk.
Part One / Preparing for Success 07

Why Big Data Projects Fail

A survey1 found that 55 percent 1. Ambiguous goals and success metrics 3. Mismanaged expectations and
of all big data projects don’t get insufficient collaboration
The reason for failure most commonly cited in
completed, and many others fall the survey was “inaccurate scope” of project. As tempting as it might be to make bold
short of their objectives. Too many companies aim for ambitious—but promises and try to deliver a heroic effort from IT,
altogether much too ambiguous—projects it is important that you set proper expectations
with no clear objectives and then fail with cross-functional stakeholders. Proactively
While this strike rate isn’t atypical at
when they have to make hard choices engage with them in the full lifecycle of projects.
such an early stage of a technology about what’s important and what’s not.
trend, it would be foolish not to learn 4. An inability to scale
2. Project overruns, delays, and long
the lessons these projects can teach.
waterfall development cycles Too often, companies aim for short-term
expediency over long-term sustainability.
Let’s look at the four main reasons Organizations that rely on antiquated While we would be remiss to suggest you can
for big data project failure. waterfall development methodologies always avoid making that trade-off, we can’t
struggle to deliver innovative projects. These stress strongly enough the importance of the
inflexible methodologies often result in both long view. For your data to be appropriately
mismanaged or stale expectations. But they secured and managed, you need to keep an eye
also slow your teams down with long on the long-term implications of your project.
delays before your deliver value.
The four reasons for big data failure are
worrying and far too common. So let’s now
take a look at how you can avoid them and
build a lasting implementation.
InformationWeek, “Vague Goals Seed Big Data Failures.”
Part One / Preparing for Success 08

How to Make Your

Big Data Project Work

If most big data projects fail for lack 1. Set clear objectives 2. Define the metrics that prove your
of clarity and inability to demonstrate and manage expectations project’s value

the practicality of the initiative, then If you’re unsure what your project should aim Clearly defined metrics that tie in to your
you should take it upon yourself to for, consider the objectives you’ve set for your objectives can save you a great deal of trouble.
bring focus and proof to your project. existing data infrastructure. By setting yourself realistic targets that can be
measured, everyone around you will be able to
Here are three useful tips to make
If your organization already needs data see the progress you’re making.
sure your project hits the ground for certain business processes (like fraud
running and stays on its feet. detection or market analysis), consider how More important, they’ll know what you’re aiming
big data might make those processes better for in the long term. Ask yourself how you
or more valuable. Rather than tackling a can measure the impact of your project in the
whole new problem, you should aim to context of your goals.
improve an existing process or project.
This is crucial because there will be short-term
Without a clear focus and demonstrable value compromises that your business users will
to business users, your project is doomed. need help rationalizing and measurable goals
help you prove you’re delivering greater value
than they realize.
Part One / Preparing for Success 09

3. Be strategic about tools and hand coding

Avoid the temptation to manually hand-code Adopt tools that can increase the productivity
everything directly in Hadoop. Remember, of your development team by leveraging the
the goal here is not to build a working skills and knowledge of your existing ETL, data
implementation with your bare hands from quality, and business intelligence experts, while
scratch—the goal is to deliver the value of freeing your Java superstars to work on specific
big data to your organization. logic for which tools aren’t available.

Instead of attempting to hand-code every Additionally, because technologies like Hadoop

integration, clean every dataset, and hand-code are evolving everyday, it’s worth considering
all the analytics, you should look to tools and an abstraction layer that can protect you from
automation to help you accelerate through the constantly changing specifications of the
these processes. underlying technologies.

More important, don’t fall into the trap of Above all, remember the skills you need are
wasting rare, expensive Java development scarce—but tools are always available.
talent on aspects that can’t be scaled or
transferred to other employees. Your role is  
to make strategic decisions about the
deployment of scarce resources in a way
that achieves your goals.
Part One / Preparing for Success 10

Choosing the
Right Project
In light of the challenges you’ll be facing,
let’s now take a look at how you should
go about picking the right project for
your organization.
Part One / Preparing for Success 11

What the Right

Project Looks Like

If your organization is hungry for 1. Demonstrable value 2. Sponsorship

change and has already accepted
The right project is one where the value is The executives who buy in to your vision are
that it will take a comprehensive data shared evenly between IT and the business unit essential to the success of your project.
governance framework to improve you’re trying to help. That means delivering clear Big data projects need champions and
the way they work, you can probably value to a department, business unit, or group sponsors in high places who are willing to
in a way that they can see. defend the work you’re doing.
afford to skip this section.
So if you know you can build superb analytics
On the other hand, if you’re for logistics, but the only executive buy in
considering a localized, tactical comes from the CMO, you should think again.
If marketing is your champion, build to serve
project that can later be adapted for marketing’s analytics requirements. You can’t
the wider enterprise, keep reading. force change on anyone. Follow the influence
and make the most value you can from it.

The right project has these

four components.
Part One / Preparing for Success 12

3. A bowling pin effect 4. Transferrable skills

The strategic importance of your first tactical As we said in the last point, you want the value
project is vital. of your first project to help convince other
departments in the enterprise. To that end,
Not only do you want to prove beyond a shadow you need to ensure you can learn the right
of a doubt that big data can help the business skills, capabilities, and lessons from your first
unit you’re serving, you also want to make sure project. More pointedly, you need to ensure
that its value can then be easily communicated you document all of them so that you can
to the wider enterprise. transfer them to your next project. Remember,
if you’re aiming for success, you’re aiming for
So when picking your first project, choose future projects.
So be prepared to scale so you can handle more
Once you’ve demonstrated the value of big data projects in the future. This isn’t just a question
to your marketing department, for instance, it of scaling your cluster. It’s about scaling your
will be easier to get buy in from the logistics skills and operations. You’ll either need to find
teams who might otherwise have been reticent. more Java/ Hadoop superstars or find ways to
get more out of the resources you already have.
Part One / Preparing for Success 13

Consider the Impact

When you’re choosing what your 1. Cost and disruption

next project will be, you also have to
At the most basic level, the cost of your project
consider how it’s going to impact is based on the time and money it will take
your organization. There are three to get it on its feet. In reality, you should also
broad aspects that should play a consider the potential disruption it will cause.

role in deciding whether you’re

Sometimes the disruption is procedural—when
pursuing the right big data project. business units are used to owning their data
and they aren’t comfortable giving up control of
it to a centralized data governance framework.

Other times it’s technological and skills-related—

when you have to integrate new technologies
into your existing infrastructure and reorganize
or upgrade skills to do so.

Whatever the case, you should anticipate,

acknowledge, and make sure you either
minimize the disruption—or communicate
why it’s valuable.
Part One / Preparing for Success 14

2. Timing of benefits and impact 3. Resources and restrictions

When considering different starting projects, In light of your analysis of the previous two
you’ll naturally lean toward those that can factors, consider the resources at your disposal.
deliver maximum business impact and We’ll be diving into this in greater detail later,
improvement. But it’s also important to consider but for now just bear in mind that you naturally
the nature of the business impact. Will it deliver want your project to deliver more bang than your
the bulk of the value in the short term or the invested buck.
long term?
Achieving that goal works both ways. On the one
More important, when will business users feel hand, you want to aim for maximum business
the business impact? For instance, you could impact. But you also have to be strategic about
introduce master data management to your how you spend your budget. While it might be
data warehouse and drastically improve the tempting to build a team of data scientists to
efficiency of your business intelligence. But that match Google’s, can you really afford to? Making
value would only be felt once your business smart choices between tools and headcount will
analysts realize they won’t be cleaning financial be critical to the success of your project.
data again.
Part One / Preparing for Success 15

Tactical Big Data Projects:

Some Examples

There are a wide variety of applications Finance Manufacturing

for big data. As exciting as this makes
–– Risk and portfolio analysis –– Connected vehicle programs
it, it also makes it a little daunting for
–– Investment recommendations –– Predictive maintenance
people who aren’t quite sure which
project to start with. Here’s a list of
tactical big data projects we’ve seen
Retail Public Sector
our customers undertake.
–– Proactive customer engagement –– Health insurance
If you’re still not sure which project
–– Location-based services –– Exchanges
your organization should be starting
–– Tax optimization
with, consider the following examples
–– Fraud detection
for a better idea of what big data Healthcare
might offer your company.
–– Patient outcome predictions

–– Total cost of care

–– Drug discovery

Part Two

Set Your Strategy

Now we’ll get practical and look
at the specific requirements for
your next big data project.
Part Two / Set Your Strategy 17

Your Goals
Get your pencil out. As we’ve already
covered, the number one cause of big
data projects failing is a lack of clear
goals. Now let’s make sure the project
you have in mind isn’t going to drown
in ambiguity.
Part Two / Set Your Strategy 18

The Business Goals

We’ll start with the business For instance, in the example of the
because these goals have to take customer service interface that
precedence over IT’s if your project predicts customer churn, the goals
is to be fully appreciated. for that project shouldn’t be listed
as something vague like “improve
Be as specific as you can be in laying customer experience.”
down the goals you want your project
to achieve for the business. And The more crystal-clear your goals,
remember to aim for goals where the closer you’ll get to achieving
the impact is measurable. them. Five laser-focused goals are
more valuable than one vague one.
Part Two / Set Your Strategy 19

The Business Goals How long should your big data

project take?

Your big data project should take as

long as is needed to deliver its full
value. In our experience, the scope of
the project dictates the time horizon.
List, in order of importance, the goal(s) Write down a minimum and maximum
We’ve worked with customers who’ve
of your big data project as they pertain amount of time for each goal to be achieved.
delivered tactical projects in less than
to the business and business users.
three months. And we’ve worked with
(Feel free to enter fewer or more goals.) e.g., Three to six months
customers who’ve spent three years
delivering foundational programs.
e.g., Reduce customer churn
For lengthier projects, bear in mind that
you should be aiming to demonstrate
the value of your project every six
months. If you adopt an agile approach
Now, for each goal, write down a measure of to the project, then it helps to present
success that can be used to determine whether the different phases and milestones as
the goal was achieved. Ideally, these should be smaller projects.
available metrics or calculations thereof.
One thing is clear—you shouldn’t be
guessing how long it’s going to take.
e.g., Reduce average monthly churn rate by
Estimate the time to deliver based on
X percent
your experience and the experience of
others who have undertaken similar
projects before. If you’re unsure who to
call for guidance, you can always get
in touch with us.
Part Two / Set Your Strategy 20

IT’s Goals

Now let’s look at IT’s goals as they pertain List, in order of importance, the goal(s) of Stop, collaborate, and listen
to your project. your big data project as they pertain to IT.
(Feel free to enter fewer or more goals.) We’ve written this workbook so you
(Remember, if your project is about helping can start your big data project, whether
IT work better or faster, you’re going to have e.g., Establish processes for real-time you work on the business side or in IT.
a hard time selling that to business users. collection, cleansing, mastering, and In either case, don’t leave your goals to
As such, IT’s goals should be communicated storage of aggregate customer data, credit guess work. If you need to get specific
in conjunction with the goals your business card usage data, social graph data, and guidance on what to aim for, grab a
users are already excited about.) churn indicators partner with the expertise you need
and start collaborating now.

If your project is going to succeed, you

can’t do without strategic collaboration.
Part Two / Set Your Strategy 21

IT’s Goals

Write down a minimum and maximum Now, for each goal, write down a measure
amount of time for each goal to be achieved. of success that can be used to determine
whether the goal was achieved. Ideally,
e.g., Two to four months these should be available metrics or
calculations thereof.

e.g., Accurate churn prediction rate of

X percent
Part Two / Set Your Strategy 22

Define Your
Data Needs
Now that we’ve outlined the specific
goals of your initiative. Whatever your
project, you’re going to have to think
strategically about what information
you need, what datasets speak to that
need, how you’re going to get that data,
and how you’re going to use it.
Part Two / Set Your Strategy 23

What Data Do You Need?

First, let’s look at the most basic purpose In order to achieve the business goals In order to deliver that knowledge, what
of your big data project; the information outlined previously, what do my business data can be used?
you’re trying to provide to your organization. users say they need to know to make an
Answer the following questions as specifically informed decision? e.g., Customer purchase history, reviews data,
as you can. rate of purchases, abandonment rate, bounce
e.g., What most valued customers are likely to rate, customer quality of service
churn and what behaviors correlate to churn
Part Two / Set Your Strategy 24

Which source systems contain Outside of the data already noted, is there
these datasets? other information that might lend context
or additional value to your analyses?
e.g., Customer service records, product
performance metrics, customer activity e.g., Customer service survey data,
database, customer master data management competitor analyses, weather data,
social data
Part Two / Set Your Strategy 25

What datasets that I currently don’t The hunt for dark data
have access to might contain additional
contextual data? When considering datasets you don’t
have access to, don’t confine yourself
e.g., Third-party social data, third-party to data outside your organization.
market data, weather data Gartner has found that most enterprises
only use 15 percent of the data inside
the organization. Appfluent, a company
that does statistical analysis on data
warehouse utilization, found that
somewhere between 30 percent and
70 percent of data in a data warehouse
is dormant. The rest is hidden away
in hard-to-reach, expensive-to-use, or
difficult-to-find silos, legacy archives,
and data stores. Which wouldn’t be a
problem if it weren’t for the fact that
you’re already paying to store all this data.

When looking for the data you need, it’s

worth starting with a look at the data
your organization already has.
Part Two / Set Your Strategy 26

Five Key Data Considerations

Once you’ve outlined the data you’re 1. Prepare for volume 2. Account for variety
going to be looking for, you’ll have
You’re going to have to get ready to deal with The most challenging aspect of big data is the
a clearer view of your specific big the ‘bigness’ of the data you need. Across multitude of formats and structures you’re
data challenges. dimensions, classify your data based on its going to have to reconcile in your analyses.
value (say, customer transactions), usage You will have to integrate a number of sources
(frequency of access), size (gigabytes, if you want to include new data types and
In particular, there are five key
terabytes), complexity (machine data, relational structures (social, sensors, video) with the
elements you should consider data, video…), and who’s allowed to access sources you’re already used to (relational,
before you go any further, as they it (only your data scientists or any casual legacy mainframes).
business user).
will dictate what needs to be done
Attempting to manually code every single
for each dataset, as well as for A thorough, organized inventory of your data integration is so cumbersome it could cost
your big data dataset. will help you determine how to manage all of you all the time and resources you have.
it. Assess your current storage and processing Make the most of available data integration
capacity and look for the most cost-effective and data quality tools to speed your process
and efficient ways to make it scalable. for more valuable tasks.
Part Two / Set Your Strategy 27

3. Handle velocity 4. Ensure veracity 5. Think compliance

The combination of real-time streaming data No matter how important your analyses, it won’t The different datasets you deal with are going
and your historical data usually increases the be worth anything if people can’t reasonably to come with different security stipulations and
predictive power of analytics. So, some of expect to trust the data that’s gone into it. The requirements. For each dataset, you need to
the data you want may only be valuable if it’s more data you analyze, the more important it is consider what it will take to anonymize the data
constantly pouring into your systems. that you maintain a high level of data quality. based on security policies.

Indeed, most real-time analyses need to be For your data to be fit for purpose, you’ll need to Masses of your data will be proliferating across
based on streaming data—often from different know the purpose it’s being used for. If a data the enterprise in hundreds of data stores.
sources, in different formats. Prepare your scientist is looking for patterns in aggregated
project with streaming analytic technology and customer data, the preparation required will be Understand where your sensitive data resides
a logical infrastructure to manage all the data. minimal. On the other hand, financial reporting and ensure you secure it at the source through
and supply chain data will need to be highly encryption and then control who has access to it.
curated, cleansed, and certified for accuracy
and compliance. Even beyond secure, smart archiving of
sensitive data, mask your data with predefined
Create categories based on the amount of rules any time it migrates or enters your
preparation needed ranging from raw data to a development and test environments.
highly curated, mastered data store of cleansed,
reliable, authoritative data. Apply these five considerations to every dataset
you deal with, and you’ll be prepared for your big
data challenge in a more realistic way.

Part Three

Your Lean Big Data

Supply Chain
To make sure your initiative is as
efficient and effective as possible,
you’re going to need the right balance
of people, process, and technology.
Part Three / Your Lean Big Data Supply Chain 29

Your Team
Your people are going to play the
biggest part in making sure your
initiative succeeds.
Part Three / Your Lean Big Data Supply Chain 30

Five Key Team-Building


Most organizations underestimate the

level of skill needed to successfully
apply new technology like Hadoop.

Distributed data frameworks are just

plain hard to manage. From the Java
skills required to develop on Hadoop
to the new data science skills for
which you’ll have to hire, you’re going
to have to bring in a lot of new skills
for your project to really fly.2

When you start building your team,

be sure to incorporate the following
lessons into your hiring strategy.

“H adoop, Python, and NoSQL lead the pack for big data jobs,”
InfoWorld, May 5, 2014
Part Three / Your Lean Big Data Supply Chain 31

1. Use the skills you hired your people for 2. T hink strategically about the composition 3. A lign the goals of your project early and
of your team then communicate them

One of the biggest mistakes companies make If things work out, your project will grow in both One of the most common errors companies
when they hire data scientists and quantitative scope and resources. Think strategically now make when hiring new people is to forget to
analysts is to make them do the dirty work. and save yourself the harsh realization that you communicate the true goals of the project.
When your most skilled resources spend all can’t scale certain processes quickly enough From the first interview through to the actual
their time hand-coding data integrations because there are a finite number of people job itself, it needs to be made crystal clear
and cleaning data, you don’t just leave them with the skills you need—even in Silicon Valley. what you are trying to deliver to business
frustrated—you fail to take advantage of users. Leverage your executive sponsorship
the skills that were so difficult to find in the If your project grows in scope, what skills to evangelize the mission and share success
first place. can you reasonably expect to find in time stories as well as issues.
to match your needs? For instance, data
Concentrate rare skills on the tasks that really scientists are infinitely harder to find, train, Without a firm grasp of the business value
need those skills. You don’t want your best and hire than developers.3 of your project, your new hires run the risk
people to leave and you certainly don’t want of thinking they only have to think about
them wasting time doing work you could The balance of your team is crucial. You’re IT’s goals for the project.
comfortably do with tools instead. looking for the right mix of hard earned data
management experience and an enthusiasm
to learn new tools. Additionally, you need to
strike a balance between people with technical
skills and people with the domain expertise
needed to build the right models.
“B ig Data’s High-Priests of Algorithms,” Wall Street Journal,
August 8, 2014
Part Three / Your Lean Big Data Supply Chain 32

4. W hen your team grows, the need to 5. Your team cannot afford to stand still The importance of being strategic
manage them grows, too
Big data technologies are emerging everyday. An important choice you’ll be making
Unlike new technology that can be deployed, And the ones that already exist are evolving over and over again is whether to build
implemented, and then integrated in an rapidly. It’s a hugely exciting time for the your capabilities using automated tools
objective fashion, new people have to get used companies brave enough to adopt best or manual integrations.
to where they’re working, what they’re doing, practices early. But it also represents the
Hand-coding offers you complete, precise
and why they’re doing it. Whether it’s you or defining challenge of having a head start on
control over what you’re building. Often
someone else, someone needs to embrace your competitors.
this is invaluable and necessary if, for
the management challenge that a new team
instance, you’re writing a complex script
calls for. Your people need to evolve their skills as rapidly
to extract metadata in a way that isn’t
as the world around them is changing. The
possible yet.
Elements like culture and cohesiveness good news is nothing motivates the best people
cannot be underestimated. Think long and hard more than the challenge of staying ahead of the Tooling, on the other hand, offers you
about how you integrate new hires into your curve. The challenge is in delivering the training greater agility and the ability to sustainably
processes. You may not be able to train them and discussion they need to keep growing their repeat the same process. For tasks like
for skills, but you can certainly help them be abilities and yours. data integration and data quality, this
better team members. is crucial because it means you aren’t
forcing your super-intelligent analysts and
scientists to do the dirty work.

Be realistic about your resources. If

you can’t build a team as big and brilliant
as Google’s, don’t waste your scarce
resources trying.
Part Three / Your Lean Big Data Supply Chain 33

Setting Up Data Governance

If (and hopefully when) you’re setting Essentially, your data governance board is the 1. Cross-functional
up a more foundational big data formal body of executives meant to oversee
the enterprise’s approach to data. But it also A data governance board comprising different
endeavor, you’re going to have to put includes the need for data stewards—functional people with similar roles will be ineffective. The
in place the procedural framework for or department-specific people who are tasked aim is to create a body capable of representing
data governance. In fact, even if your with managing the data coming from a specific the unique views and needs of each business
business unit. unit your big data project is meant to serve.
big data project is aimed at delivering
value to a single department, you (In fact, some of our clients assign data 2. Communicative
might consider building a miniature stewardship roles based on the data domain.
data governance board so you can That means one person is in charge of product Without good communication across functions,
data, while another is in charge of customer departments, and domains, your project is likely
learn how to cope with the unique data, and so on.) to drown in bureaucracy and misunderstanding.
challenges such a body presents. This happens far too often. Ensure all concerns
You should aim to create processes that ensure are either abated or appropriately addressed.
your data governance framework helps more
than it hurts. Actively work to ensure it doesn’t
turn into a bureaucratic burden by ensuring
everyone is committed to achieving the same
goals in the same time frames.

Your data governance framework should aim

for the following five characteristics.
Part Three / Your Lean Big Data Supply Chain 34

3. Efficient 5. Centralized

Your cross-functional process shouldn’t feel The biggest challenge with a data governance
like a barrier. It will take great agility for your big framework comes when you have to prioritize
data project to succeed. So build in automation the goals of one business unit over the others
and exception reporting rules wherever possible being represented on the board. Ensure your
and adopt collaboration tools to keep the lines decisions are for the long-term benefit of
of communication open and expedient. the whole board even if it means short-term
benefits being felt by one business unit.
4. Committed

Be sure to communicate the primary goals of

your project effectively and make sure everyone
involved in your data governance framework is
dedicated to achieving those goals. Common
goals should guide your governance thinking
and decision-making.
Part Three / Your Lean Big Data Supply Chain 35

The Skills You Need

and the Skills You Have

Time to get your pencil out again. Now that you can
see the various subjective pitfalls and opportunities
your new team will present, let’s figure out what this
team will actually look like.

The following page lists big data roles based on the

jobs for which we’ve seen our customers hire. Based
on the personnel currently available to you and the
amount of time you expect your project to take
(as entered in the section beginning on page 23),
list how many people you need to hire.
Part Three / Your Lean Big Data Supply Chain 36

The role Can anyone perform I need to hire for Based on the amount The need for
this role already? this role of time I have, I need integrated thinking
to hire X people

Data scientist When you go out looking for new

Domain expert team members, don’t limit yourself
to people with the right qualifications.
Business analyst Make no mistake—finding people with
Data analyst the right qualifications is a challenge
in and of itself. But you also need to
Data engineer
look for people with a willingness
Database administrator to synthesize business goals and
technical capabilities.
Enterprise architect

Business solution architect Over and over, we hear from customers

Data architect about how important it is that the
people who join their big data projects
Data steward are able to understand business
ETL (data integration) developer realities and execute complex data
science. This type of integrated
Application developer
thinking is huge and hard to find. It’s
Dashboard developer worth training for and worth rewarding.
Statistical modeler





Part Three / Your Lean Big Data Supply Chain 37

Big Data Tools

The tools you use have a strategic Analytics Machine learning

role to play in the execution of your
The tools and processes that turn raw data into Can you apply sophisticated machine learning
big data project. Go through this insight, patterns, predictions, and calculations algorithms to identify patterns and make
list and put an X against the tools about the domain you’re analyzing. predictions at a level that you don’t have the
that are most important—and manual bandwidth to handle?
most strategically relevant—to
Among these tools and technologies, some
your specific project. Can you present your data and findings in tools like data integration, data quality, and
ways that are easy to digest and understand? master data management are so fundamental
to your big data journey that they really aren’t
Advanced analytics worth re-building. The amount of time and
resources it takes to build those capabilities
Can you apply cutting-edge analytic with your own hands isn’t worth the precious
algorithms to your datasets to conduct skills and man-hours of your big data project.
complex calculations?
Remember the goals of your project, and that
they didn’t include custom-building everything.
Part Three / Your Lean Big Data Supply Chain 38

Let’s dive in to the actual processes you’ll
need to tackle big data. Your specific
processes will be unique to your goals
and requirements, but this section should
give you an overview of what to expect
and what you’ll be learning.
Part Three / Your Lean Big Data Supply Chain 39

The Big Data Eight

From experience, we can say that 1. Accessing the data 3. Cleansing the data
agile methodologies are an excellent
Your first challenge will be to acquire all the For your analyses to be reliable, you have to
approach to big data projects. They data you need. In some cases, this will entail ensure you’re cleansing your data to remove
ensure you manage expectations, capturing streaming data, and in others it will duplications, errors, inaccuracies, and
learn from mistakes, and iterate your mean extracting data from a database. Set up incomplete data. Your process should ensure
repeatable, manageable processes to ensure your most qualified analysts and scientists
way to the best processes. That said,
this data can then be stored according to the aren’t spending all their time effectively
the approach to your project depends ways you’re going to use it. “doing the dishes.”
entirely on you and your situation.
2. Integrating the data 4. Mastering the data
In any event, the following eight steps will prove
crucial to your big data supply chain. However Big data’s most complex challenge involves the One way to maintain a reliable source of clean,
you go about it, ensure you and your team variety of data structures and formats. For your integrated data is to establish a process to
establish effective processes for these steps. analyses to be sustainably conducted, you’ll master your data. The aim is to create a rich
need to set up a process for integrating and collection of consolidated data, organized by
normalizing all this data. Ideally, this should take domain (such as products, customers, etc.),
Go through this list of tools and put an X up as little manual processing as possible. and enriched with big data insights, that can
against the ones most important—and then feed all your other systems.
most strategically relevant—to your
specific project.
Part Three / Your Lean Big Data Supply Chain 40

5. Securing the data 7. Analyzing your business needs The importance of documentation

Here, you’ll be establishing two basic processes. This step is both critical and almost always Aim to master these eight steps and your
The first will be about defining the security rules overlooked. Set up a clear process for the big data project will be heading in the
and practices that each dataset calls for. The analysis of business needs even while you’re right direction. The aim is to establish
second will be about detecting sensitive data analyzing your data. This is crucial because clear, repeatable, scalable, continuously
and masking it in a persistent or dynamic way if you take your finger off the pulse of the improving processes. To that end, the
to ensure those rules and best practices are business, you risk isolating your efforts and documentation of these processes and
enforced consistently. ensuring the business impact is minimized. the ensuing improvements are vital to
your team.
6. Analyzing the data 8. Operationalizing the insight
The skills, capabilities, and lessons
of your big data project must be made
Your process for analysis will depend on As we discussed early on in this workbook,
transferable and communicated frequently.
your analysts, your analytics tools, and your the business impact of your big data project
requirements as they pertain to your goals. The needs to be felt. Create automated pipelines
mindset of iterative discovery and continuous for the answers you find and deliver them to
improvement will play a crucial role here as you the business users who need them most. For
want this process to get better, faster, cheaper, instance, data about customers most likely
and more scalable with time and experience. to churn should be made available to your
customer service agents via a dashboard.
Be sure to incorporate a feedback loop as well
so you can see how the insight is received.
Part Three / Your Lean Big Data Supply Chain 41

For your big data supply chain to be lean
and effective, you have to ensure the
architecture is solid and strategically
built. In this section we’ll look at what
the ideal big data architecture should
look like and how to go about deploying
yours in a phased approach.
Part Three / Your Lean Big Data Supply Chain 42

First Steps:
Your Sandbox

When you start building the Start small Mask before you test
architecture for your big data project,
By starting with a well-defined sandbox When organizations use test data, they usually
the most logical starting point is to over which you have complete control, use a variant of their live, production data to
set up a sandbox development you’ll be able to iterate your way to the most ensure the formats and structures represent
environment in which you can use successful implementation. Get up and the live environment. Unfortunately, a failure
running as soon as possible and document to mask this data appropriately can leave
test data to ensure your architecture
the lessons learned through each iteration. sensitive data exposed in an entirely unsafe
is feasible. In doing so, make sure test environment.
you consider the following lessons. Size matters
Don’t get lost in translation
The key difference between the sandbox
and your actual implementation is the fact One of the most common sources of project
that your production environment will be far overruns and costly delays in big data projects
larger. It will require automated processing stems from the fact that hand-coded errors
to ingest, integrate, cleanse, and distribute that were missed in the sandbox come back to
the output. As such, it will take a far more haunt the team when the architecture goes live.
robust structure and proven components So if you hand-code significant parts of your
and processes to be truly reliable and architecture, expect to re-factor a lot of code
flexible in a live production environment. to meet production-level requirements and
manage expectations accordingly. Alternatively,
use productivity and automation tooling to
avoid the need to re-factor your code and
errors in the first place.
Part Three / Your Lean Big Data Supply Chain 43

The Ideal Big Data Architecture

The following diagram represents the way

we recommend building the ideal big data
technology and process architecture.

Data Sources Data Management Applications

–– Relational databases Data Ingestion –– Data integration Data Delivery –– Visualization

–– Mainframe –– Data quality –– Mobile apps

Batch load Batch load
–– Documents and emails –– Virtual data machine –– Analytics
Change data capture Data integration hub
–– S ocial media, third- –– Data security –– Business intelligence
party data, log files
Data streaming –– Master data Data virtualization –– Real-time dashboards
–– Machine sensor management
Archiving Reat-time and event
–– Public cloud –– S calable storage
based processing
(e.g., Hadoop)
–– Private cloud
–– Data warehouse
Part Three / Your Lean Big Data Supply Chain 44

Project Plan
We’ve now analyzed every aspect of your
big data journey. The next step for you is
to use this project plan as a skeletal guide
to managing your big data project from
inception to implementation.
Part Three / Your Lean Big Data Supply Chain 45

Your Project Plan

Use this project plan template as Stage 1: The strategy Stage 2: The data
a framework to document the
Identify the business and IT’s goals Identify the information you need
details and various elements of
Define your measures of success Identify the data and sources to deliver it
your big data project. Then use the
compiled document as a way to
get the buy in you need from the
rest of your organization. It will
also come in handy when you
approach external partners.
Part Three / Your Lean Big Data Supply Chain 46

Stage 3: The supply chain Stage 4: Operationalize the insight

The people The tools Develop dashboards

–– Assessing the skills you need –– Distributed computing (e.g., Hadoop) Automate processes for data delivery

–– Assessing the skills you have –– Data quality Set up a feedback process

–– Data integration
The process
–– Master data management
–– Access data
–– Data masking
–– Integrate data
–– Visualization
–– Cleanse data
–– Streaming analytics
–– Master data
–– Analytics
–– Secure data
–– Machine learning
–– Analyze the data

–– Analyze the business needs

Part Three / Your Lean Big Data Supply Chain 47

Getting Going

Use the checklists, principles, and guidelines we’ve

outlined in this workbook to organize your initiative
and make sure you’re primed for success. Big data
has the potential to transform your enterprise and
help you unlock all sorts of new value. So think
about how you manage your people, processes,
and technology to deliver on your vision.

And the transformation will be big.


Further Reading

How Big Data Management Turns Petabytes

into Profits

Big data integration and management are critical

foundations for getting new insights from big data.
With the right approach, you can unite disparate and
complex information from multiple sources and
turn it into trusted, timely, and actionable data for
advanced analytics.

Get our white paper How Big Data Management Turns

Petabytes into Profits to discover the three pillars of
big data management, and how to turn a big
data laboratory into a big data factory.
About Informatica

Digital transformation changes expectations: better service, faster Worldwide Headquarters

delivery, with less cost. Businesses must transform to stay relevant 2100 Seaport Blvd, Redwood City, CA 94063, USA
and data holds the answers. Phone: 650.385.5000
Fax: 650.385.5500
As the world’s leader in Enterprise Cloud Data Management, we’re Toll-free in the US: 1.800.653.3871
prepared to help you intelligently lead—in any sector, category or
niche. Informatica provides you with the foresight to become more informatica.com
agile, realize new growth opportunities or create new inventions. linkedin.com/company/informatica
With 100 percent focus on everything data, we offer the versatility twitter.com/Informatica
needed to succeed.

We invite you to explore all that Informatica has to offer—and unleash CONTACT US
the power of data to drive your next intelligent disruption.


© Copyright Informatica LLC 2014, 2017. Informatica and the Informatica logo are trademarks
or registered trademarks of Informatica LLC in the United States and other countries.