Вы находитесь на странице: 1из 14

10 Hard Questions to

Make Your Choice


of Self-Service
Data Prep Easy
PAXATA | EBOOK
INTRODUCTION

Self-service data prep may sound easy, intuitive, and exactly what you need for
that daunting data transformation and preparation project. Yet, given the vast
differences among products and solutions available today, it is easy to get lost.

This guide offers ten real business scenarios that will elicit your attention, help
you cut out the noise, and focus on what truly fits your needs and requirements.
In each scenario, we provide real life examples and use cases so that you can
envision your own environment in the description.
1 IS REAL TIME BUSINESS SCENARIOS

EXPLORATION OF DATA AN
IMPORTANT PART OF
YOUR DATA PREPARATION? Data lakes that were populated
with no attention to data quality
Not all data prep solutions are created equal. Some
follow a workflow style where predetermined rules
are applied to prepare data for analysis. This requires
knowing the questions to ask of the data beforehand. Data with high potential of anomaly e.g. weight
For example, in a supply chain scenario, you should that could be recorded in ounces, pounds, grams,

prescribe the tool to replace any supplier location kilograms, and more

occurrence of “Phinix” with “Phoenix”. But one can


quickly see the limitations of this approach. In this
example, would you have considered “Finix”? Where context is important (e.g. location data
can be London, England or London, Ontario)
As opposed to workflow style data prep products,
interactive data prep solutions allow you to explore
and validate your data by seeing all of its contents –
in real time. Data with a wide distribution of values (e.g. budget
distribution or product type distribution)

Data with gaps, peaks, and outliers


(e.g. an outlier year of 2107)

Onboarding new client data

paxata.com 3
2 DO YOU HAVE A LOT OF
UNKNOWN DATASETS,
SUCH AS THIRD-PARTY
DATA AND FORM-FILLS?
BUSINESS SCENARIOS
When data is sourced from in-house systems of records,
the user typically possesses some level of knowledge
about it. However, very little is known about data that
comes from external sources. For example, onboarding
new client or supplier data varies in type and complexity. Onboarding and blending second- and
third-party data with first-party data
The same situation occurs when data comes from
form fields such as those in survey software, marketing
automation, logistics and scheduling apps, ERP, and
Curating external sources of
others. Data in these situations typically have a wide
data to create a data product
variety of misspellings.

Similarly, in application migrations or consolidation of


legacy systems that were created years ago, one lacks
Application migration or application integration
full knowledge of the information architecture.

In order to better understand the data context and


semantics, one must think about exploring, profiling,
Data quality and validation of free form text fields (e.g.
and analyzing the data values and content before address cleanup for marketing campaigns)
integrating or migrating it. Built-in, smart algorithms that
can detect semantic and syntactic context of the data,
potential joins and different spellings, can accelerate the
time to value in these scenarios.

paxata.com 4
3 WOULD INACCURATE
DATA JEOPARDIZE
YOUR REPUTATION OR
REVENUE?
Some data preparation tools often limit the user to a
small sample of data. In this case, one is left to hope that
all of the possible anomalies and outliers are included
in the small, allocated sample of data. While this is a BUSINESS SCENARIOS

viable solution in some use cases, such as prototyping


or requirement gathering that will subsequently be
validated against the entire body of data, it is a not a
solution that is suitable for sensitive situations.
When the full body of data is needed to
ensure accuracy (e.g., financial crimes
In scenarios where the accuracy of data compromises
compliance or clinical trials)
the outcome, the data prep solution must provide the
flexibility to choose the sample size and be able to
explore and prepare the entire data. For example, in
financial crimes compliance reporting, failure to detect When data is your product (e.g. information market
fraudulent transactions which happen to be outliers places, information as a service, or product catalogs)

and are therefore not included in the data sample can


severely damage the financial institution’s reputation.

paxata.com 5
4 IS GOVERNANCE A KEY
PIECE OF YOUR DATA
PREPARATION AND
REPORTING?
In many cases, a data prep project leads to downstream
reporting and analytics that are used across executive BUSINESS SCENARIOS

and multi-functional teams, or externally by government


bodies. In these cases, factors such as how the data is
sourced, the transformations that it has gone through,
and usage analytics on who is consuming it are all Weekly sales and marketing reports
important elements in gaining trust and complying
with regulations.

Data prep solutions that self-document every user Executive dashboards


action and each machine-learning operation applied to
the data ensure complete auditability.

The key is to provide end-to-end traceability of data. For Evidence-based medicine in healthcare
example, in cases where the information from the data
preparation ends up in downstream systems such as
business intelligence applications or public portals, it is
Compliance reporting, such as anti-money laundering
vital to show a full lineage.

paxata.com 6
5 IS VERSIONING AND
HAVING A SNAPSHOT
OF YOUR DATA PREP
PROJECTS AND DATASETS BUSINESS SCENARIOS

CRITICAL?
In some cases, data prep projects are a one-time
event. However, you will certainly accumulate new data,
Regulatory and audit-heavy programs
expand information presented in a business dashboard,
extend source systems, or need to record snapshots of
your prepared data for point-in-time analysis.
Data quality monitoring
A basic example is when one set of data is prepared
for bookings before it is augmented with another
dataset to conduct bookings versus revenue analysis.
In this scenario, the user may want to store and version Single source to multi-source

“bookings” before blending it with other data sets and progression of data blending

storing it as “revenue versus booking.”

Having a complete history of a data preparation project


Daily / weekly / quarterly reporting
at any given point is not a capability offered in all data
prep solutions. Only those with a database backend
equipped to store a snapshot of data in time are able to
achieve this.

paxata.com 7
6 DO YOU HAVE A HIGH
RATIO OF BUSINESS SMES
TO DATA ENGINEERS?
Historically, data preparation has been within the
BUSINESS SCENARIOS
purview of IT teams. This is partly a legacy issue, as the
tools and techniques that were created in the 1990s
were created for developers with technical skills. Today,
this paradigm is shifting.
Weekly sales reporting
The reality is that for every data engineer or technical
resource in a given organization, there are potentially
1000+ business analysts, all of whom want more
Supply chain / inventory management
information to perform their jobs better. Therefore,
treating data preparation as an IT task inherently creates
a bottleneck.
IoT device usage data analysis
The context of the data remains within the line of
business. For instance, a supply chain manager would
be aware that “Govis Pharmaceuticals” and “Novis
Pharmaceuticals” are in fact the same vendor, and the Call center data prep
different entries are purely recording mistakes. IT would
not have knowledge of that context.

So, for scenarios where business knowledge is critical


in preparing data, or scenarios where the number of
analysts outweighs the number of technical resources,
a business-oriented data preparation tool makes the
most sense.

paxata.com 8
7 DO YOU NEED TO CREATE
GOVERNED BUSINESS
USER SANDBOXES FOR
YOUR DATA LAKE? BUSINESS SCENARIOS
In addition to scenarios where a business team wants
to take a hands-on approach to its data preparation
as described in section 6, there are scenarios where IT
desires to provide a governed or contained environment
Data lake exploration
for its business users.

This could be a data lake type of scenario. In order to


unlock the value of this data, IT needs to section off
Business use case development and prototyping
parts of the lake to business teams.

Unfortunately, data lakes provide limited data


exploration capabilities for business users. They Ad hoc analysis of product usage (e.g. IoT device usage)
require familiarity with SQL or programming languages.
Additionally, the performance of data lakes is often poor,
and querying directly from the lake creates latency and a
less-than-ideal user experience.

A sophisticated data prep solution that provides an


easy-to-use interface for business users to interact with
data – ideally, at the speed of thought – is critical to
improving the data lake’s business value.

paxata.com 9
8 ARE YOU OPERATING IN
A MULTI-REGION, MULTI-
DEPARTMENT, OR MULTI-
CLIENT ENVIRONMENT? BUSINESS SCENARIOS

Enterprise-ready data prep solutions have advanced multi-


tenancy capabilities created to meet the needs of diverse
business entities and their SLAs.
Marketing operations accessing
For example, a global sales and marketing organization might sales and support tenants for insights

need to create individual tenants for data prep projects


across corporate and regional teams. While each tenant has
full separation of data and functional privileges (e.g., creating All individuals across various LDAPs / SAML
a project versus only viewing it), some individuals may need to authentication servers accessing a common tenant
access multiple tenants. (e.g. AML tenant)

Reverse the scenarios. There might be a variety of types and


sources of authentication servers. For example, a US team
may use a series of LDAP servers, whereas the UK team has Never-expiring service accounts to access tenants

its own separate, albeit SAML, authentication framework, but


all teams across the globe must access a shared tenant.

A basic data prep tool cannot serve these many layers of Consultants and OEMs who provide data prep services
to their customers accessing all of their customer
complexity of authentication and authorization, while an
tenants using a single ID and password
enterprise one can.

paxata.com 10
9 DO YOU HAVE A
MULTI-CLOUD STRATEGY?
Many companies today are avoiding the vendor lock-in
situations that occurred a couple of decades ago with
the titans of the enterprise software industry. These
companies are considering a hybrid environment –
BUSINESS SCENARIOS
using a mixture of various cloud environments and
in-house systems.

Taking an agnostic approach to data prep software


that can run across multiple cloud and on-premises
Regional / localized cloud services investments
environments not only reduces vendor lock-in, it also
increases the interoperability that is required for fast
migrations of workloads from one environment to
the other. Hybrid cloud and on-premises environments
to segregate sensitive and non-sensitive data
It ensures that data prep can live closer to the
data gravity where it is at rest, and also enables
integration and movement of data across cloud and
Choosing the right cloud infrastructure provider
on-premises sources.
for right application or workload

paxata.com 11
10 DO YOU HAVE A VARIETY
OF STAKEHOLDERS WHO
WANT TO PARTICIPATE
IN DATA PREPARATION
PROJECTS? BUSINESS SCENARIOS

Today, data projects are no longer in the hands of one


developer or one technical resource. A data project is
often conducted by multiple parties, including analysts,
Cross-team projects (e.g. sales, support, and marketing)
decision-makers, reporting and analytics teams, data
science groups, and others.

Therefore, the traditional approach of creating data


Data preparation projects with large groups of analysts
projects and publishing results for others to see and
provide feedback to, slows the process; it is a legacy,
waterfall approach.
Real time business and IT collaborations
A modern data prep application provides a real time,
inline, multi-user experience, with simultaneous editing
and immediate feedback for all parties involved to be
part of the preparation project. As matter of fact, the Information as a service projects and exchanges
between a vendor who is preparing data for its clients
style of collaboration should be similar to that leveraged
in Google sheets or Google docs types of applications.

Workflow style types of data prep solutions don’t offer


such a fluid experience. If multi-user collaboration is a
critical requirement for you, look for modern data prep
solutions that offer this seamless user experience.

paxata.com 12
CLOSING THOUGHTS

Choosing the right data prep solution is not easy. While on the surface everything
sounds the same, you now know that not all tools are created equal. You need to
bear in mind your business scenarios, types of users, and internal and external
requirements before selecting the right data prep solution that is tailored to your
organization’s specific needs.

13
Companies around the globe rely on Paxata to get smart about information. Paxata is
the pioneer that intelligently empowers all business consumers to transform raw data
into ready information, instantly and automatically, with an enterprise-grade, self-service
data preparation application and machine learning platform. Our Adaptive Information
Platform weaves data into an information fabric from any source and any cloud to create
trusted insights. Business consumers use clicks, not code to achieve results in minutes,
not months. With Paxata, Be an Information Inspired Business.

Washington D.C., and Singapore.

Paxata Headquarters 1800 Seaport Boulevard Redwood City, CA 94063 1-855-9-PAXATA paxata.com

© 2018 Paxata, Inc. All Rights Reserved

Вам также может понравиться